July 19, 2016

Life of a Librarian

How not to work during a sabbatical

This presentation — given at Code4Lib Midwest (Chicago, July 14, 2016) — outlines the various software systems I wrote during my recent tenure as an adjunct faculty member at the University of Notre Dame. (This presentation is also available as a one-page PDF handout designed to be duplex printed and folded in half as if it were a booklet.)

  • viewHow rare is rare? – In an effort to determine the “rarity” of items in the Catholic Portal, I programmatically searched WorldCat for specific items, counted the number of times it was held by libraries in the United States, and recorded the list of the holding libraries. Through the process I learned that most of the items in the Catholic Portal are “rare”, but I also learned that “rarity” can be defined as the triangulation of scarcity, demand, and value. Thus the “rare” things may not be rare at all.
  • Image processing – By exploiting the features and functions of an open source library called OpenCV, I started exploring ways to evaluate images in the same way I have been evaluating texts. By counting & tabulating the pixels in an image it is possible to create ratios of colors, do facial recognition, or analyze geometric composition. Through these processes is may be possible to supplement art history and criticism. For example, one might be able to ask things like, “Show me all of the paintings from Picasso’s Rose Period.”
  • Library Of Congress Name Authorities – Given about 125,000 MARC authority records, I wrote an application that searched the Library Of Congress (LOC) Name Authority File, and updated the local authority records with LOC identifiers, thus making the local authority database more consistent. For items that needed disambiguation, I created a large set of simple button-based forms allowing librarians to choose the most correct name.
  • MARC record enrichment – Given about 500,000 MARC records describing ebooks, I wrote a program that found the richest OCLC record in WorldCat and then merged the found record with the local record. Ultimately the local records included more access points and thus proved to be more useful in a library catalog setting.
  • OAI-PMH processing – I finally got my brain around the process of harvesting & indexing OAI-PMH content into VUFind. Whoever wrote the original OAI-PMH applications for VUFind did a very good job, but there is a definite workflow to the process. Now that I understand the workflow it is relatively easy ingest metadata from things like ContentDM, but issues with the way Dublin Core is implement still make the process challenging.
  • EEBO/TCP – Given the most beautiful TEI mark-up I’ve ever seen, I have systematically harvested the Early English Books Online (EEBO) content from the Text Encoding Initiative (TCP) and done some broad & deep but also generic text analysis subsets of the collection. Readers are able to search the collection for items of interest, save the full text to their own space for analysis, and have a number of rudimentary reports done against the result. This process allows the reader to see the corpus from a “distance”. Very similar work has been done against subsets of content from JSTOR as well as the HathiTrust.
  • VIAF Lookup – Given about 100,000 MARC authority records, I wrote a program to search VIAF for the most appropriate identifier and associate it with the given record. Through the process I learned two things: 1) how to exploit the VIAF API, and 2) how to exploit the Levenshtein algorithm. Using the later I was able to make automated and “intelligent” choices when it came to name disambiguation. In the end, I was able to accurately associate more than 80% of the authority names with VIAF identifiers.

My tenure as an adjunct faculty member was very much akin to a one year education except for a fifty-five year old. I did many of the things college students do: go to class, attend sporting events, go on road trips, make friends, go to parties, go home for the holidays, write papers, give oral presentations, eat too much, drink too much, etc. Besides the software systems outlined above, I gave four or five professional presentations, attend & helped coordinate five or six professional meetings, taught an online, semester-long, graduate-level class of on the topic of XML, took many different classes (painting, sketching, dance, & language) many times, lived many months in Chicago, Philadelphia, and Rome, visited more than two dozen European cities, painted about fifty paintings, bound & filled about two dozen hand-made books, and took about three thousand photographs. The only thing I didn’t do is take tests.

by Eric Lease Morgan at July 19, 2016 02:43 PM

June 03, 2016

Mini-musings

Achieving perfection

Through the use of the Levenshtein algorithm, I am achieving perfection when it comes to searching VIAF. Well, almost.

trastevareI am making significant progress with VIAF Finder [0], but now I have exploited the use of the Levenshtein algorithm. In fact, I believe I am now able to programmatically choose VIAF identifiers for more than 50 or 60 percent of the authority records.

The Levenshtein algorithm measures the “distance” between two strings. [1] This distance is really the number of keystrokes necessary to change one string into another. For example, the distance between “eric” and “erik” is 1. Similarly the distance between “Stefano B” and “Stefano B.” is still 1. Along with a colleague (Stefano Bargioni), I took a long, hard look at the source code of an OpenRefine reconciliation service which uses VIAF as the backend database. [2] The code included the calculation of a ratio to denote the relative distance of two strings. This ratio is the quotient of the longest string minus the Levenshtein distance divided by the length of the longest string. From the first example, the distance is 1 and the length of the string “eric” is 4, thus the ratio is (4 – 1) / 4, which equals 0.75. In other words, 75% of the characters are correct. In the second example, “Stefano B.” is 10 characters long, and thus the ratio is (10 – 1) / 10, which equals 0.9. In other words, the second example is more correct than the first example.

Using the value of MARC 1xx$a of an authority file, I can then query VIAF. The SRU interface returns 0 or more hits. I can then compare my search string with the search results to create a ranked list of choices. Based on this ranking, I am able to more intelligently choose VIAF identifiers. For example, from my debugging output, if I get 0 hits, then I do nothing:

       query: Lucariello, Donato
        hits: 0

If I get too many hits, then I still do nothing:

       query: Lucas Lucas, Ramón
        hits: 18
     warning: search results out of bounds; consider increasing MAX

If I get 1 hit, then I automatically save the result, which seems to be correct/accurate most of the time, even though the Levenshtein distance may be large:

       query: Lucaites, John Louis
        hits: 1
       score: 0.250     John Lucaites (57801579)
      action: perfection achieved (updated name and id)

If I get many hits, and one of them exactly matches my query, then I “achieved perfection” and I save the identifier:

       query: Lucas, John Randolph
        hits: 3
       score: 1.000     Lucas, John Randolph (248129560)
       score: 0.650     Lucas, John R. 1929- (98019197)
       score: 0.500     Lucas, J. R. 1929- (2610145857009722920913)
      action: perfection achieved (updated name and id)

If I get many hits, and many of them are exact matches, then I simply use the first one (even though it might not be the “best” one):

       query: Lucifer Calaritanus
        hits: 5
       score: 1.000     Lucifer Calaritanus (189238587)
       score: 1.000     Lucifer Calaritanus (187743694)
       score: 0.633     Luciferus Calaritanus -ca. 370 (1570145857019022921123)
       score: 0.514     Lucifer Calaritanus gest. 370 n. Chr. (798145857991023021603)
       score: 0.417     Lucifer, Bp. of Cagliari, d. ca. 370 (64799542)
      action: perfection achieved (updated name and id)

If I get many hits, and none of them are perfect, but the ratio is above a configured threshold (0.949), then that is good enough for me (even if the selected record is not the “best” one):

       query: Palanque, Jean-Remy
        hits: 5
       score: 0.950     Palanque, Jean-Rémy (106963448)
       score: 0.692     Palanque, Jean-Rémy, 1898- (46765569)
       score: 0.667     Palanque, Jean Rémy, 1898- (165029580)
       score: 0.514     Palanque, J. R. (Jean-Rémy), n. 1898 (316408095)
       score: 0.190     Marrou-Davenson, Henri-Irénée, 1904-1977 (2473942)
      action: perfection achieved (updated name and id)

By exploiting the Levenshtein algorithm, and by learning from the good work of others, I have been able to programmatically select VIAF identifiers for more than half of my authority records. When one has as many as 120,000 records to process, this is a good thing. Moreover, this use of the Levenshtein algorithm seems to produce more complete results when compared to the VIAF AutoSuggest API. AutoSuggest identified approximately 20 percent of my VIAF identifiers, while my Levenshtein algorithm/logic identifies more than 40 or 50 percent. AutoSuggest is much faster though. Much.

Fun with the intelligent use of computers, and think of the possibilities.

[0] VIAF Finder – http://infomotions.com/blog/2016/05/viaf-finder/

[1] Levenshtein – http://bit.ly/1Wz3qZC

[2] reconciliation service – https://github.com/codeforkjeff/refine_viaf

by Eric Lease Morgan at June 03, 2016 09:48 AM

May 31, 2016

Catholic Portal

Catholic Pamphlets and the Catholic Portal: An evolution in librarianship

cover pageThis blog posting outlines, describes, and demonstrates how a set of Catholic pamphlets were digitized, indexed, and made accessible through the Catholic Portal. In the end it advocates an evolution in librarianship.

A few years ago, a fledgling Catholic pamphlets digitization process was embarked upon. [1] In summary, a number of different library departments were brought together, a workflow was discussed, timelines were constructed, and in the end approximately one third of the collection was digitized. The MARC records pointing to the physical manifestations of the pamphlets were enhanced with URLs pointing to their digital surrogates and made accessible through the library catalog. [2] These records were also denoted as being destined for the Catholic Portal by adding a value of CRRA to a local note. Consequently, each of the Catholic Pamphlet records also made their way to the Portal. [3]

Because the pamphlets have been digitized, and because the digitized versions of the pamphlets can be transformed into plain text files using optical character recognition, it is possible to provide enhanced services against this collection, namely, text mining services. Text mining is a digital humanities application rooted in the counting and tabulation of words. By counting and tabulating the words (and phrases) in one or more texts, it is possible to “read” the texts and gain a quick & dirty understanding of their content. Probably the oldest form of text mining is the concordance, and each of the digitized pamphlets in the Portal is associated with a concordance interface.

For example, the reader can search the Portal for something like “is the pope always right”, and the result ought to return a pointer to a pamphlet named Is the Pope always right? of papal infallibility. [4] Upon closer examination, the reader can download a PDF version of the pamphlet as well as use a concordance against it. [5, 6] Through the use of the concordance the reader can see that the words church, bill, charlie, father, and catholic are the most frequently used, and by searching the concordance for the phrase “pope is”, the reader gets a single sentence fragment in the result, “…ctrine does not declare that the Pope is the subject of divine inspiration by wh…” And upon further investigation, the reader can see this phrase is used about 80% of the way through the pamphlet.

The process of digitizing library materials is very much like the workflows of medieval scriptoriums, and the process is well understood. Description and access to digital versions of original materials is well-accommodated by the exploitation of MARC records. The next step for the profession to move beyond find & get and towards use & understand. Many people can find many things, with relative ease. The next step for librarianship is to provide services against the things readers find so they can more easily learn & comprehend. Save the time of the reader. The integration of the University of Notre Dame’s Hesburgh Libraries’s Catholic Pamphlets Collection into the Catholic Portal is one possible example of how this evolutionary process can be implemented.

Links

[1] digitization process – http://blogs.nd.edu/emorgan/2012/03/pamphlets/

[2] library catalog – http://bit.ly/sw1JH8

[3] Catholic Portal – http://bit.ly/cathholicpamphlets

[4] “Of Papal Infallibility” – http://www.catholicresearch.net/vufind/Record/undmarc_003078072

[5] PDF version – http://repository.library.nd.edu/view/45/743445.pdf

[6] concordance interface – https://concordance.library.nd.edu/app/concordance/?id=743445

by Eric Lease Morgan at May 31, 2016 12:34 PM

May 27, 2016

Mini-musings

VIAF Finder

This posting describes VIAF Finder. In short, given the values from MARC fields 1xx$a, VIAF Finder will try to find and record a VIAF identifier. [0] This identifier, in turn, can be used to facilitate linked data services against authority and bibliographic data.

Quick start

Here is the way to quickly get started:

  1. download and uncompress the distribution to your Unix-ish (Linux or Macintosh) computer [1]
  2. put a file of MARC records named authority.mrc in the ./etc directory, and the file name is VERY important
  3. from the root of the distribution, run ./bin/build.sh

VIAF Finder will then commence to:

  1. create a “database” from the MARC records, and save the result in ./etc/authority.db
  2. use the VIAF API (specifically the AutoSuggest interface) to identify VAIF numbers for each record in your database, and if numbers are identified, then the database will be updated accordingly [3]
  3. repeat Step #2 but through the use of the SRU interface
  4. repeat Step #3 but limiting searches to authority records from the Vatican
  5. repeat Step #3 but limiting searches to the authority named ICCU
  6. done

Once done the reader is expected to programmatically loop through ./etc/authority.db to update the 024 fields of their MARC authority data.

Manifest

Here is a listing of the VIAF Finder distribution:

  • 00-readme.txt – this file
  • bin/build.sh – “One script to rule them all”
  • bin/initialize.pl – reads MARC records and creates a simple “database”
  • bin/make-dist.sh – used to create a distribution of this system
  • bin/search-simple.pl – rudimentary use of the SRU interface to query VIAF
  • bin/search-suggest.pl – rudimentary use of the AutoSuggest interface to query VIAF
  • bin/subfield0to240.pl – sort of demonstrates how to update MARC records with 024 fields
  • bin/truncate.pl – extracts the first n number of MARC records from a set of MARC records, and useful for creating smaller, sample-sized datasets
  • etc – the place where the reader is expected to save their MARC files, and where the database will (eventually) reside
  • lib/subroutines.pl – a tiny set of… subroutines used to read and write against the database

Usage

If the reader hasn’t figured it out already, in order to use VIAF Finder, the Unix-ish computer needs to have Perl and various Perl modules — most notably, MARC::Batch — installed.

If the reader puts a file named authority.mrc in the ./etc directory, and then runs ./bin/build.sh, then the system ought to run as expected. A set of 100,000 records over a wireless network connection will finish processing in a matter of many hours, if not the better part of a day. Speed will be increased over a wired network, obviously.

But in reality, most people will not want to run the system out of the box. Instead, each of the individual tools will need to be run individually. Here’s how:

  1. save a file of MARC (authority) records anywhere on your file system
  2. not recommended, but optionally edit the value of DB in bin/initialize.pl
  3. run ./bin/initialize.pl feeding it the name of your MARC file, as per Step #1
  4. if you edited the value of DB (Step #2), then edit the value of DB in bin/search-suggest.pl, and then run ./bin/search-suggest.pl
  5. if you want to possibly find more VIAF identifiers, then repeat Step #4 but with ./bin/search-simple.pl and with the “simple” command-line option
  6. optionally repeat Step #5, but this time use the “named” command-line option, and the possible named values are documented as a part of the VAIF API (i.e., “bav” denotes the Vatican
  7. optionally repeat Step #6, but with other “named” values
  8. optionally repeat Step #7 until you get tired
  9. once you get this far, the reader may want to edit bin/build.sh, specifically configuring the value of MARC, and running the whole thing again — “one script to rule them all”
  10. done

A word of caution is now in order. VIAF Finder reads & writes to its local database. To do so it slurps up the whole thing into RAM, updates things as processing continues, and periodically dumps the whole thing just in case things go awry. Consequently, if you want to terminate the program prematurely, try to do so a few steps after the value of “count” has reached the maximum (500 by default). A few times I have prematurely quit the application at the wrong time and blew my whole database away. This is the cost of having a “simple” database implementation.

To do

Alas, search-simple.pl contains a memory leak. Search-simple.pl makes use of the SRU interface to VIAF, and my SRU queries return XML results. Search-simple.pl then uses the venerable XML::XPath Perl module to read the results. Well, after a few hundred queries the totality of my computer’s RAM is taken up, and the script fails. One work-around would be to request the SRU interface to return a different data structure. Another solution is to figure out how to destroy the XML::XPath object. Incidentally, because of this memory leak, the integer fed to simple-search.pl was implemented allowing the reader to restart the process at a different point dataset. Hacky.

Database

The use of the database is key to the implementation of this system, and the database is really a simple tab-delimited table with the following columns:

  1. id (MARC 001)
  2. tag (MARC field name)
  3. _1xx (MARC 1xx)
  4. a (MARC 1xx$a)
  5. b (MARC 1xx$b and usually empty)
  6. c (MARC 1xx$c and usually empty)
  7. d (MARC 1xx$d and usually empty)
  8. l (MARC 1xx$l and usually empty)
  9. n (MARC 1xx$n and usually empty)
  10. p (MARC 1xx$p and usually empty)
  11. t (MARC 1xx$t and usually empty)
  12. x (MARC 1xx$x and usually empty)
  13. suggestions (a possible sublist of names, Levenshtein scores, and VIAF identifiers)
  14. viafid (selected VIAF identifier)
  15. name (authorized name from the VIAF record)

Most of the fields will be empty, especially fields b through x. The intention is/was to use these fields to enhance or limit SRU queries. Field #13 (suggestions) is for future, possible use. Field #14 is key, literally. Field #15 is a possible replacement for MARC 1xx$a. Field #15 can also be used as a sort of sanity check against the search results. “Did VIAF Finder really identify the correct record?”

Consider pouring the database into your favorite text editor, spreadsheet, database, or statistical analysis application for further investigation. For example, write a report against the database allowing the reader to see the details of the local authority record as well as the authority data in VIAF. Alternatively, open the database in OpenRefine in order to count & tabulate variations of data it contains. [4] Your eyes will widened, I assure you.

Commentary

birdFirst, this system was written during my “artist’s education adventure” which included a three-month stint in Rome. More specifically, this system was written for the good folks at Pontificia Università della Santa Croce. “Thank you, Stefano Bargioni, for the opportunity, and we did some very good collaborative work.”

Second, I first wrote search-simple.pl (SRU interface) and I was able to find VIAF identifiers for about 20% of my given authority records. I then enhanced search-simple.pl to include limitations to specific authority sets. I then wrote search-suggest.pl (AutoSuggest interface), and not only was the result many times faster, but the result was just as good, if not better, than the previous result. This felt like two steps forward and one step back. Consequently, the reader may not ever need nor want to run search-simple.pl.

Third, while the AutoSuggest interface was much faster, I was not able to determine how suggestions were made. This makes the AutoSuggest interface seem a bit like a “black box”. One of my next steps, during the copious spare time I still have here in Rome, is to investigate how to make my scripts smarter. Specifically, I hope to exploit the use of the Levenshtein distance algorithm. [5]

Finally, I would not have been able to do this work without the “shoulders of giants”. Specifically, Stefano and I took long & hard looks at the code of people who have done similar things. For example, the source code of Jeff Chiu’s OpenRefine Reconciliation service demonstrates how to use the Levenshtein distance algorithm. [6] And we found Jakob Voß’s viaflookup.pl useful for pointing out AutoSuggest as well as elegant ways of submitting URL’s to remote HTTP servers. [7] “Thanks, guys!”

Fun with MARC-based authority data!

Links

[0] VIAF – http://viaf.org

[1] VIAF Finder distribution – http://infomotions.com/sandbox/pusc/etc/viaf-finder.tar.gz

[2] VIAF API – http://www.oclc.org/developer/develop/web-services/viaf.en.html

[4] OpenRefine – http://openrefine.org

[5] Levenshtein distance – https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance

[6] Chiu’s reconciliation service – https://github.com/codeforkjeff/refine_viaf

[7] Voß’s viaflookup.pl – https://gist.github.com/nichtich/832052/3274497bfc4ae6612d0c49671ae636960aaa40d2

by Eric Lease Morgan at May 27, 2016 01:34 PM

May 09, 2016

Mini-musings

Making stone soup: Working together for the advancement of learning and teaching

It is simply not possible for any of us to do our jobs well without the collaboration of others. Yet specialization abounds, jargon proliferates, and professional silos are everywhere. At the same time we all have a shared goal: to advance learning and teaching. How are we to balance these two seemingly conflicting characteristics in our workplace? How can we satisfy the demands of our day-to-day jobs and at the same time contribute to the work of others? ‡

bouquet sunflowers roses

The answer is not technical but instead rooted in what it means to a part of a holistic group of people. The answer is rooted in things like the abilities to listen, to share, to learn, to go beyond tolerance and towards respect, to take a sincere interest in the other person’s point of view, to discuss, and to take to heart the idea that nobody really sees the whole picture.

As people — members of the human race — we form communities with both our strengths & our weaknesses, with things we know would benefit the group & things we would rather not share, with both our beauties and our blemishes. This is part of what it means to be people. There is no denying it, and if we try, then we are only being less of who we really are. To deny it is an unrealistic expectation. We are not gods. We are not actors. We are people, and being people — real people — is a good thing.

Within any community, there are norms of behavior. Without norms of behavior, there is really no community, only chaos and anarchy. In anarchy and chaos, physical strength is oftentimes the defining characteristic of decision-making, but when the physically strong are outnumbered by the emotionally mature and intellectually aware, then chaos and anarchy are overthrown for a more holistic set of decision-making proceses. Examples include democracy, consensus building, and even the possibility governance through benevolent dictatorship.

A community’s norms are both written and unwritten. Our workplaces are good examples of such communities. On one hand there may be policies & procedures, but these policies & procedures usually describe workflows, the methods used to evaluate employees, or to some extent strategic plans. They might outline how meetings are conducted or how teams are to accomplish their goals. On the other hand, these policies & procedures do not necessarily outline how to talk with fellow employees around the virtual water cooler, how to write email messages, nor how to greet each on a day-to-day basis. Just as importantly, our written norms of behavior do not describe how to treat and communicate with people outside one’s own set of personal expertise. Don’t get me wrong. This does not mean I am advocating written norms for such things, but such things do need to be discussed and agreed upon. Such are the beginnings of stone soup.

Increasingly we seem to work in disciplines of specialization, and these specializations, necessarily, generate their own jargon. “Where have all the generalists gone? Considering our current environment, is it really impossible to be a Renaissance Man^h^h^h Person?” Increasingly, the answer to the first question is, “The generalists have gone the way of Leonardo DiVinci.” And the answer to the second question is, “Apparently so.”

For example, some of us lean more towards formal learning, teaching, research, and scholarship. These are the people who have thoroughly studied and now teach a particular academic discipline. These same people have written dissertations, which, almost by defintion, are very specialized, if not unique. They live in a world pursuant of truth while balancing the worlds of rigorous scholarly publishing and student counseling.

There are those among us who thoroughly know the in’s and out’s of computer technology. These people can enumerate the differences between a word processor and a text editor. They can compare & contrast operating systems. These people can configure & upgrade software. They can make computers communicate on the Internet. They can trouble-shoot computer problems when the computers seem — for no apparent reason — to just break.

Finally, there are those among us who specialize in the collection, organization, preservation, and dissemination of data, information, and knowledge. These people identify bodies of content, systematically describe it, make every effort to preserve it for posterity, and share it with their respective communities. These people deal with MARC records, authority lists, and subject headings.

Despite these truisms, we — our communities — need to figure out how to work together, how to bridge the gaps in our knowledge (a consequence of specialization), and how to achieve our shared goals. This is an aspect of our metaphoric stone soup.

So now the problem can be re-articulated. We live and work in communities of unwritten and poorly articulated norms. To complicate matters, because of our specializations, we all approach our situations from different perspectives and use different languages to deal with the situations. As I was discussing this presentation with a dear friend & colleague, the following poem attributed to Prissy Galagarian was brought to my attention†, and it eloquently states the imperative:

  The Person Next to You

  The person next to you is the greatest miracle
   and the greatest mystery you will ever
   meet at this moment.

  The person next to you is an inexhaustible
   reservoir of possibility,
   desire and dread,
   smiles and frowns, laughter and tears,
   fears and hopes,
   all struggling to find expression.

  The person next to you believes in something,
   stands for something, counts for something,
   lives for something, labors for something,
   waits for something, runs from something,
   runs to something.

  The person next to you has problems and fears,
   wonders how they're doing,
   is often undecided and unorganized
   and painfully close to chaos!
   Do they dare speak of it to you?

  The person next to you can live with you
   not just alongside you,
   not just next to you.

  The person next to you is a part of you.
   for you are the person next to them.

How do we overcome these impediments in order to achieve our mutual goals of the workplace? The root of the answer lies in our ability to really & truly respect our fellow employees.

Working together towards a shared goals is a whole lot like making “stone soup”. Do you know the story of “stone soup”? A man comes into a village, and asks the villagers for food. Every time he asks he is told that there is nothing to give. Despite an apparent lack of anything, the man sets up a little fire, puts a pot of water on, and drops a stone into the pot. Curious people come by, and they ask, “What are you doing?” He says, “I’m making stone soup, but I think it needs a bit of flavor.” Wanting to participate, people begin to add their own things to the soup. “I think I have some carrots,” says one villager. “I believe I have a bit of celery,” says another. Soon the pot is filled with bits of this and that and the other thing: onions, salt & pepper, a beef bone, a few tomatoes, a couple of potatoes, etc. In the end, a rich & hearty stew is made, enough for everybody to enjoy. Working together, without judgement nor selfishness, the end result is a goal well-accomplished.

This can happen in the workplace as well. It can happen in our community where the goal is teaching & learning. And in the spirit of cooking, here’s a sort of recipe:

  1. Understand that you do not have all the answers, and in fact, nobody does; nobody has the complete story nor sees the whole picture. Only after working on a task, and completing it at least once, will a holistic perspective begin to develop.
  2. Understand that nobody’s experience is necessarily more important than the others’, including your own. Everybody has something to offer, and while your skills & expertise may be imperative to success, so are the skills & expertise of others. And if there an established hierarchy within your workplace, understand that the hierarchy is all but arbitrary, and maintained by people with an over-developed sense of power. We all have more things in common than differences.
  3. Spend the time to get to know your colleagues, and come to a sincere appreciation of who they are as a person as well as a professional. This part of the “recipe” may include formal or informal social events inside or outside the workplace. Share a drink or a meal. Take a walk outside or through the local museum. Do this in groups of two or more. Such activities provide a way for everybody involved to reflect upon an outside stimulus. Through this process the interesting characteristics of the others will become apparent. Appreciate these characteristics. Do not judge them, but rather respect them.
  4. Remember, listening is a wonderful skill, and when the other person talks for a long time, they will go away thinking they had a wonderful conversation. Go beyond hearing what a person says. Internalize what they say. Ask meaningful & constructive questions, and speak their name frequently during discussions. These things will demonstrate your true intentions. Through this process the others will become a part of you, and you will become a part of them.
  5. Combine the above ingredients, bring them to a boil, and then immediately lower the temperature allowing everything to simmer for a good long time. Keeping the pot boiling will only overheat the soup and make a mess. Simmering will keep many of the individual parts intacked, enable the flavors to mellow, and give you time to set the table for the next stage of the process.

Finally, making stone soup does not require fancy tools. A cast iron pot will work just as well as one made from aluminium or teflon. What is needed is a container large enough to hold the ingredients and withstand the heat. It doesn’t matter whether or not the heat source is gas, electric, or fire. It just has to be hot enough to allow boiling and then simmering. Similarly, stone soup in the workplace does not require Google Drive, Microsoft Office 365, nor any type of wiki. Sure, those things can facilitate project work, but they are not the means for getting to know your colleagues. Only through personal interaction will such knowledge be garnered.

Working together for the advancement of learning & teaching — or just about any other type of project work — is a lot like making stone soup. Everybody contributes a little something, and the result is nourishing meal for all.

‡ This essay was written as a presentation for the AMICAL annual conference which took place in Rome (May 12-14, 2016), and this essay is available as a one-page handout.

http://fraternalthoughts.blogspot.it/2011/02/person-next-to-you.html

by Eric Lease Morgan at May 09, 2016 12:26 PM

April 14, 2016

Mini-musings

Protected: Simile Timeline test

This content is password protected. To view it please enter your password below:

by Eric Lease Morgan at April 14, 2016 08:02 PM

April 07, 2016

Mini-musings

Editing authorities at the speed of four records per minute

This missive outlines and documents an automated process I used to “cleanup” and “improve” a set of authority records, or, to put it another way, how I edited authorities at the speed of four records per minute.

springAs you may or may not know, starting in September 2015, I commenced upon a sort of “leave of absence” from my employer.† This leave took me to Tuscany, Venice, Rome, Provence, Chicago, Philadelphia, Boston, New York City, and back to Rome. In Rome I worked for the American Academy of Rome doing short-term projects in the library. The first project revolved around authority records. More specifically, the library’s primary clientele were Americans, but the catalog’s authority records included a smattering of Italian headings. The goal of the project was to automatically convert as many of the “invalid” Italian headings into “authoritative” Library of Congress headings.

Identify “invalid” headings

pantheonWhen I first got to Rome I had the good fortune to hang out with Terry Reese, the author of the venerable MarcEdit.‡ He was there giving workshops. I participated in the workshops. I listened, I learned, and I was grateful for a Macintosh-based version of Terry’s application.

When the workshops were over and Terry had gone home I began working more closely with Sebastian Hierl, the director of the Academy’s library.❧ Since the library was relatively small (about 150,000 volumes), and because the Academy used Koha for its integrated library system, it was relatively easy for Sebastian to give me the library’s entire set of 124,000 authority records in MARC format. I fed the authority records into MarcEdit, and ran a report against them. Specifically, I asked MarcEdit to identify the “invalid” records, which really means, “Find all the records not found in the Library of Congress database.” The result was a set of approximately 18,000 records or approximately 14% of the entire file. I then used MarcEdit to extract the “invalid” records from the complete set, and this became my working data.

Search & download

alterI next created a rudimentary table denoting the “invalid” records and the subsequent search results for them. This tab-delimited file included values of MARC field 001, MARC field 1xx, an integer denoting the number of times I searched for a matching record, an integer denoting the number of records I found, an identifier denoting a Library of Congress authority record of choice, and a URL providing access to the remote authority record. This table was initialized using a script called authority2list.pl. Given a file of MARC records, it outputs the table.

I then systematically searched the Library of Congress for authority headings. This was done with a script called search.pl. Given the table created in the previous step, this script looped through each authority, did a rudimentary search for a valid entry, and output an updated version of the table. This script was a bit “tricky”.❦ It first searched the Library of Congress by looking for the value of MARC 1xx$a. If no records were found, then no updating was done and processing continued. If one record was found, then the Library of Congress identifier was saved to the output and processing continued. If many records were found, then a more limiting search was done by adding a date value extracted from MARC 1xx$d. Depending on the second search result, the output was updated (or not), and processing continued. Out of original 18,000 “invalid” records, about 50% of them were identified with no (zero) Library of Congress records, about 30% were associated with multiple headings, and the remaining 20% (approximately 3,600 records) were identified with one and only one Library of Congress authority record.

I now had a list of 3,600 “valid” authority records, and I needed to download them. This was done with a script called harvest.pl. This script is really a wrapper around a program called GNU Wget. Given my updated table, the script looped through each row, and if it contained a URL pointing to a Library of Congress authority record, then the record was cached to the file system. Since the downloaded records were formatted as MARCXML, I then needed to transform them into MARC21. This was done with a pair of scripts: xml2marc.sh and xml2marc.pl. The former simply looped through each file in a directory, and the later did the actual transformation but along the way updated MARC 001 to the value of the local authority record.

Verify and merge

backyardIn order to allow myself as well as others to verify that correct records had been identified, I wrote another pair of programs: marc2compare.pl and compare2html.pl. Given two MARC files, marc2compare.pl created a list of identifiers, original authority values, proposed authority values, and URLs pointing to full descriptions of each. This list was intended to be poured into a spreadsheet for compare & contrast purposes. The second script, compare2html.pl, simply took the output of the first and transformed it into a simple HTML page making it easier for a librarian to evaluate correctness.

Assuming the 3,600 records were correct, the next step was to merge/overlay the old records with the new records. This was a two-step process. The first step was accomplished with a script called rename.pl. Given two MARC files, rename.pl first looped through the set of new authorities saving each identifier to memory. It then looped through the original set of authorities looking for records to update. When records to update were found, each was marked for deletion by prefixing MARC 001 with “x-“. The second step employed MarcEdit to actually merge the set of new authorities with the original authorities. Consequently, the authority file increased in size by 3,600 records. It was then up to other people to load the authorities into Koka, re-evaluate the authorities for correctness, and if everything was okay, then delete each authority record prefixed with “x-“.

Done.❀

Summary and possible next steps

In summary, this is how things happened. I:

  1. got a complete dump of original authority 123,329 records
  2. extracted 17,593 “invalid” records
  3. searched LOC for “valid” records and found 3,627 of them
  4. harvested the found records
  5. prefixed the 3,627 001 fields in the original file with “x-“
  6. merged the original authority records with the harvested records
  7. made the new set of 126,956 updated records available

academyThere were many possible next steps. One possibility was to repeat the entire process but with an enhanced search algorithm. This could be difficult considering the fact that searches using merely the value of 1xx$a returned zero hits for half of the working data. A second possibility was to identify authoritative records from a different system such as VIAF or Worldcat. Even if this was successful, I wonder how possible it would have been to actually download authority records as MARC. A third possibility was to write a sort of disambiguation program allowing librarians to choose from a set of records. This could have been accomplished by searching for authorities, presenting possibilities, allowing librarians to make selections via an HTML form, caching the selections, and finally, batch updating the master authority list. Here at the Academy we denoted the last possibility as the “cool” one.

Now here’s an interesting way to look at the whole thing. This process took me about two weeks worth of work, and in that two weeks I processed 18,000 authority records. That comes out to 9,000 records/week. There are 40 hours in work week, and consequently, I processed 225 records/hour. Each hour is made up of 60 minutes, and therefore I processed approximately 4 records/minute, and that is 1 record every fifteen seconds for the last two weeks. Wow!?

Finally, I’d like to thank the Academy (with all puns intended). Sebastian, his colleagues, and especially my office mate (Kristine Iara) were all very supportive throughout my visit. They provided intellectual stimulation and something to do while I contemplated my navel during the “adventure”.

Notes

bicycles† Strictly speaking, my adventure was not a sabbatical nor a leave of absence because: 1) as a librarian I was not authorized to take a sabbatical, and 2) I did not have any healthcare issues. Instead, after bits of negotiation, my contract was temporarily changed from full-time faculty to adjunct faculty, and I worked for my employer 20% of the time. The other 80% of time was spent on my “adventure”. And please don’t get me wrong, this whole thing was a wonderful opportunity for which I will be eternally grateful. “Thank you!”

‡ During our overlapping times there in Rome, Terry & I played tourist which included the Colosseum, a happenstance mass at the Pantheon, a Palm Sunday Mass in St. Peter’s Square with tickets generously given to us by Joy Nelson of ByWater Solutions, and a day-trip to Florence. Along the way we discussed librarianship, open source software, academia, and life in general. A good time was had by all.

❧ Ironically, Sebastian & I were colleagues during the dot-com boom when we both worked at North Caroline State University. The world of librarianship is small.

❦ This script — search.pl — was really a wrapper around an application called curl, and thanks go to Jeff Young of OCLC who pointed me to the ATOM interface of the LC Linked Data Service. Without Jeff’s helpful advice, I would have wrestled with OCLC’s various authentication systems and Web Service interfaces.

❀ Actually, I skipped a step in this narrative. Specifically, there are some records in the authority file that were not expected to be touched, even if they are “invalid”. This set of records was associated with a specific call number pattern. Two scripts (fu-extract.pl and fu-remove.pl) did the work. The first extracted a list of identifiers not to touch and the second removed them from my table of candidates to validate.

by Eric Lease Morgan at April 07, 2016 07:52 AM

Date created: 2000-05-19
Date updated: 2011-05-03
URL: http://infomotions.com/