August 28, 2014

Readings

Hundredth Psalm to the Tune of "Green Sleeves": Digital Approaches to Shakespeare's Language of Genre

Provides a set of sound arguments for the use of computers to analyze texts, and uses DocuScope as an example.
  • Creator(s): Hope, Jonathan;Witmore, Michael
  • Date created: 2010-09-21
  • Date read: 2014-08-28
  • Facet/terms: Formats/Journal articles; Themes/Text mining;
  • Rights: Restricted
  • Source: Jonathan Hope. and Michael Witmore. "The Hundredth Psalm to the Tune of "Green Sleeves": Digital Approaches to Shakespeare's Language of Genre." Shakespeare Quarterly 61.3 (2010): 357-390. Project MUSE. Web. 28 Aug. 2014. <http://muse.jhu.edu/>
  • Versions(s): original; local/annotated

August 28, 2014 04:00 AM

August 16, 2014

Mini-musings

Publishing LOD with a bent toward archivists


eye candy by Eric

This essay provides an overview of linked open data (LOD) with a bent towards archivists. It enumerates a few advantages the archival community has when it comes to linked data, as well as some distinct disadvantages. It demonstrates one way to expose EAD as linked data through the use of XSLT transformations and then through a rudimentary triple store/SPARQL endpoint combination. Enhancements to the linked data publication process are then discussed. The text of this essay in the form of a handout as well as a number of support files is can also be found at http://infomotions.com/sandbox/lodlamday/.

Review of RDF

The ultimate goal of LOD is to facilitate the discovery of new information and knowledge. To accomplish this goal, people are expected to make metadata describing their content available on the Web in one or more forms of RDF — Resource Description Framework. RDF is not so much a file format as a data structure. It is a collection of “assertions” in the form of “triples” akin to rudimentary “sentences” where the first part of the sentence is a “subject”, the second part is a “predicate”, and the third part is an “object”. Both the subjects and predicates are required to be Universal Resource Identifiers — URIs. (Think “URLs”.) The subject URI is intended to denote a person, place, or thing. The predicate URI is used to specify relationships between subjects and the objects. When verbalizing RDF assertions, it is usually helpful to prefix predicate URIs with a “is a” or “has a” phrase. For example, “This book ‘has a’ title of ‘Huckleberry Finn'” or “This university ‘has a’ home page of URL”. The objects of RDF assertions are ideally more URIs but they can also be “strings” or “literals” — words, phrases, numbers, dates, geo-spacial coordinates, etc. Finally, it is expected that the URIs of RDF assertions are shared across domains and RDF collections. By doing so, new assertions can be literally “linked” across the world of RDF in the hopes of establishing new relationships. By doing so new new information and new knowledge is brought to light.

Simple foray into publishing linked open data

Manifesting RDF from archival materials by hand is not an easy process because nobody is going to manually type the hundreds of triples necessary to adequately describe any given item. Fortunately, it is common for the description of archival materials to be manifested in the form of EAD files. Being a form of XML, valid EAD files must be well-formed and conform to a specific DTD or schema. This makes it easy to use XSLT to transform EAD files into various (“serialized”) forms of RDF such as XML/RDF, turtle, or JSON-LD. A few years ago such a stylesheet was written by Pete Johnston for the Archives Hub as a part of the Hub’s LOCAH project. The stylesheet outputs XML/RDF and it was written specifically for Archives Hub EAD files. It has been slightly modified here and incorporated into a Perl script. The Perl script reads the EAD files in a given directory and transforms them into both XML/RDF and HTML. The XML/RDF is intended to be read by computers. The HTML is intended to be read by people. By simply using something like the Perl script, an archive can easily participate in LOD. The results of these efforts can be seen in the local RDF and HTML directories. Nobody is saying the result is perfect nor complete, but it is more than a head start, and all of this is possible because the content of archives is often times described using EAD.

Triple stores and SPARQL endpoints

By definition, linked data (RDF) is structured data, and structured data lends itself very well to relational database applications. In the realm of linked data, these database applications are called “triple stores”. Database applications excel at the organization of data, but they are also designed to facilitate search. In the realm of relational databases, the standard query language is called SQL, and there is a similar query language for triples stores. It is called SPARQL. The term “SPARQL endpoints” is used denote a URL where SPARQL queries can be applied to a specific triple store.

4store is an open source triple store application which also supports SPARQL endpoints. Once compiled and installed, it is controlled and managed through a set of command-line applications. These applications support the sorts of things one expects with any other database application such as create database, import into database, search database, dump database, and destroy database. Two other commands turn on and turn off SPARQL endpoints.

For the purposes of LODLAM Training Day, a 4store triple store was created, filled with sample data, and made available as a SPARQL endpoint. If it has been turned on, then the following links ought to return useful information and demonstrating additional ways of publishing linked data:

Advantages and disadvantages

The previous sections demonstrate the ease at which archival metadata can be published as linked data. These demonstrations are not the the be-all nor end-all of linked data the publication process. Additional techniques could be employed. Exploiting content negotiation in response to a given URI is an excellent example. Supporting alternative RDF serializations is another example. It behooves the archivist to provide enhanced views of the linked data, which are sometimes called “graphs”. The linked data can be combined with the linked data of other publishers to implement even more interesting services, views, and graphs. All of these things are advanced techniques requiring the skills of additional people (graphic designers, usability experts, computer programmers, systems administrators, allocators of time and money, project managers, etc.). Despite this, given the tools outlined above, it is not too difficult to publish linked data now and today. Such are the advantages.

On the other hand, there are at least two distinct disadvantages. The most significant derives from the inherent nature of archival material. Archival material is almost always rare or unique. Because it is rare and unique, there are few (if any) previously established URIs for the people and things described in archival collections. This is unlike the world of librarianship, where the materials of libraries are often owned my multiple institutions. Union catalogs share authority lists denoting people and institutions. Shared URIs across domains is imperative for the idea of the Semantic Web to come to fruition. The archival community has no such collection of shared URIs. Maybe the community-wide implementation and exploitation of Encoded Archival Context for Corporate Bodies, Persons, and Families (EAC-CPF) can help resolve this problem. After all, it too is a form of XML which lends itself very will to XSLT transformation.

Second, and almost as importantly, the use of EAD is not really the best way manifest archival metadata for linked data publication. EADs are finding aids. They are essentially narrative essays describing collections as a whole. They tell stories. The controlled vocabularies articulated in the header do not necessarily apply to each of the items in the container list. For good reasons, the items in the container list are minimally described. Consequently, the resulting RDF statement come across rather thin and poorly linked to fuller descriptions. Moreover, different archivists put different emphases on different aspect of EAD description. This makes amalgamated collections of archival linked data difficult to navigate; the linked data requires cleaning and normalization. The solution to these problems might be to create and maintain archival collections in database applications, such as ArchivesSpace, and have linked data published from there. By doing so the linked data publication efforts of the archival community would be more standardized and somewhat centralized.

Summary

This essay has outlined the ease at which archival metadata in the form of EAD can be easily published as linked data. The result is far from perfect, but a huge step in the right direction. Publishing linked data is not an event, but rather an iterative process. There is always room for improvement. Starting today, publish your metadata as linked data.

by Eric Lease Morgan at August 16, 2014 02:56 PM

August 07, 2014

Readings

Theme from Macroanalysis: Digital Methods and Literary History (Topics in the Digital Humanities)

This chapter describes the how's and why's of topic modeling.

August 07, 2014 04:00 AM

July 19, 2014

Mini-musings

Fun with Koha

These are brief notes about my recent experiences with Koha.

Introduction

koha logoAs you may or may not know, Koha is a grand daddy of library-related open source software, and it is an integrated library system to boot. Such are no small accomplishments. For reasons I will not elaborate upon, I’ve been playing with Koha for the past number of weeks, and in short, I want to say, “I’m impressed.” The community is large, international, congenial, and supportive. The community is divided into a number of sub-groups: developers, committers, commercial support employees, and, of course, librarians. I’ve even seen people from another open source library system (Evergreen) provide technical support and advice. For the most part, everything is on the ‘Net, well laid out, and transparent. There are some rather “organic” parts to the documentation akin to an “English garden”, but that is going to happen in any de-centralized environment. All in all, and without any patronizing intended, “Kudos to Koha!”

Installation

Looking through my collection of tarballs, I see I’ve installed Koha a number of times over the years, but this time it was challenging. Sparing you all the details, I needed to use a specific version of MySQL (version 5.5), and I had version 5.6. The installation failure was not really Koha’s fault. It is more the fault of MySQL because the client of MySQL version 5.6 outputs a warning message to STDOUT when a password is passed on the command line. This message confused the Koha database initialization process, thus making Koha unusable. After downgrading to version 5.5 the database initialization process was seamless.

My next step was to correctly configure Zebra — Koha’s default underlying indexer. Again, I had installed from source, and my Zebra libraries, etc. were saved in a directory different from the configuration files created by the Koha’s installation process. After correctly updating the value of modulePath to point to /usr/local/lib/idzebra-2.0/ in zebra-biblios-dom.cfg, zebra-authorities.cfg, zebra-biblios.cfg, and zebra-authorities-dom.cfg I could successfully index and search for content. I learned this from a mailing list posting.

Koha “extras”

Koha comes (for free) with a number of “extras”. For example, the Zebra indexer can be deployed as both a Z39.50 server as well as an SRU server. Turning these things on was as simple as uncommenting a few lines in the koha-conf.xml file and opening a few ports in my firewall. Z39.50 is inherently unusable from a human point of view so I didn’t go into configuring it, but it does work. Through the use of XSL stylesheets, SRU can be much more usable. Luckily I have been here before. For example, a long time ago I used Zebra to index my Alex Catalogue as well as some content from the HathiTrust (MBooks). The hidden interface to the Catalogue sports faceted searching and used to support spelling corrections. The MBooks interface transforms MARCXML into simple HTML. Both of these interfaces are quite zippy. In order to get Zebra to recognize my XSL I needed to add an additional configuration directive to my koha-conf.xml file. Specifically, I need to add a docpath element to my public server’s configuration. Once I re-learned this fact, implementing a rudimentary SRU interface to my Koha index was easy and results are returned very fast. I’m impressed.

My big goal is to figure out ways Koha can expose its content to the wider ‘Net. To this end sKoha comes with an OAI-PMH interface. It needs to be enabled, and can be done through the Koha Web-based backend under Home -> Koha Administration -> Global Preferences -> General Systems Preferences -> Web Services. Once enabled, OAI sets can be created through the Home -> Administration -> OAI sets configuration module. (Whew!) Once this is done Koha will respond to OAI-PMH requests. I then took it upon myself to transform the OAI output into linked data using a program called OAI2LOD. This worked seamlessly, and for a limited period of time you can browse my Koha’s cataloging data as linked data. The viability of the resulting linked data is questionable, but that is another blog posting.

Ideas and next steps

Library catalogs (OPACs, “discovery systems”, whatever you want to call them) are not simple applications/systems. They are a mixture of very specialized inventory lists, various types of people with various skills and authorities, indexing, and circulation, etc. Then we — as librarians — add things like messages of the day, record exporting, browsable lists, visualizations, etc. that complicate the whole thing. It is simply not possible to create a library catalog in the “Unix way“. The installation of Koha was not easy for me. There are expenses with open source software, and I all but melted down my server during the installation process. (Everything is now back to normal.) I’ve been advocating open source software for quite a while, and I understand the meaning of “free” in this context. I’m not complaining. Really.

Now that I’ve gotten this far, my next step is to investigate the feasibility of using a different indexer with Koha. Zebra is functional. It is fast. It is multi-faceted (all puns intended). But configuring it is not straight-forward, and its community of support is tiny. I see from rooting around in the Koha source code that Solr has been explored. I have also heard through the grapevine that ElasticSearch has been explored. I will endeavor to explore these things myself and report on what I learn. Different indexers, with more flexible API’s may make the possibility of exposing Koha content as linked data more feasible as well.

Wish me luck.

by Eric Lease Morgan at July 19, 2014 06:16 PM

July 16, 2014

Readings

Matisse: "Jazz"

"Arguably one of the most beloved works of twentieth-century art, Henri Matisse's "Jazz" portfolio - with its inventiveness, spontaneity, and pure intensely pigmented color - projects a sense of joy and freedom." These are the gallery notes from an exhibit of Jazz at the Des Moines (Iowa) art museum.

July 16, 2014 04:00 AM

Jazz, (Henri Matisse)

"Jazz (1947) is an artist's book of 250 prints for the folded book version and 100 impressions for the suite, which contains the unfolded pochoirs without the text, based on paper cutouts by Henri Matisse. Teriade, a noted 20th century art publisher, arranged to have Matisse's cutouts rendered as pochoir (stencil) prints."

July 16, 2014 04:00 AM

Context for the creation of Jazz

"In 1943, while convalescing from a serious operation, Henri Matisse began work on a set of collages to illustrate an, as yet, untitled and undecided text. This suite of twenty images, translated into "prints" by the stenciling of gouache paint, became known as Jazz---considered one of his most ambitious and important series of work." These are notes about the work Jazz by Matisse.

July 16, 2014 04:00 AM

July 08, 2014

Life of a Librarian

Lexicons and sentiment analysis – Notes to self

This is mostly a set of notes to myself on lexicons and sentiment analysis.

A couple of weeks ago I asked Jeffrey Bain-Conkin to read at least one article about sentiment analysis (sometimes called “opinion mining”), and specifically I asked him to help me learn about the use of lexicons in such a process. He came back with a few more articles and a list of pointers to additional information. Thank you, Jeffrey! I am echoing the list here for future reference, for the possible benefit of others, and to remove some of the clutter from my to-do list. While I haven’t read and examined each of the items in great detail, just re-creating the list increases my knowledge. The list is divided into three sections: lexicons, software, and “more”.

Lexicons

  • Arguing Lexicon – “The lexicon includes patterns that represent arguing.”
  • BOOTStrep Bio-Lexicon – “Biological terminology is a frequent cause of analysis errors when processing literature written in the biology domain. For example, ‘retro-regulate’ is a terminological verb often used in molecular biology but it is not included in conventional dictionaries. The BioLexicon is a linguistic resource tailored for the biology domain to cope with these problems. It contains the following types of entries: a set of terminological verbs, a set of derived forms of the terminological verbs, general English words frequently used in the biology domain, [and] domain terms.”
  • English Phrases for Information Retrieval – “Goal of the ‘English Phrases for IR’ (EP4IR) project at the Radboud University Nijmegen (The Netherlands) is the development of a grammar and lexicon of English suitable for applications in Information Retrieval and available in the public domain.”
  • General Inquirer – “The General Inquirer is basically a mapping tool. It maps each text file with counts on dictionary-supplied categories. The currently distributed version combines the ‘Harvard IV-4′ dictionary content-analysis categories, the ‘Lasswell’ dictionary content-analysis categories, and five categories based on the social cognition work of Semin and Fiedler, making for 182 categories in all. Each category is a list of words and word senses. A category such as ‘self references’ may contain only a dozen entries, mostly pronouns. Currently, the category ‘negative’ is our largest with 2291 entries. Users can also add additional categories of any size.”
  • NRC word-emotion association lexicon – “The lexicon has human annotations of emotion associations for more than 24,200 word senses (about 14,200 word types). The annotations include whether the target is positive or negative, and whether the target has associations with eight basic emotions (joy, sadness, anger, fear, surprise, anticipation, trust, disgust).” The URL also points to a large number of articles on sentiment analysis in general.
  • Subjectivity Lexicon – “The Subjectivity Lexicon (list of subjectivity clues) that is part of OpinionFinder…”
  • WordNet – “WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.”
  • WordNet Domains – “WordNet Domains is a lexical resource created in a semi-automatic way by augmenting WordNet with domain labels. WordNet Synsets have been annotated with at least one semantic domain label, selected from a set of about two hundred labels structured according the WordNet Domain Hierarchy. Information brought by domains is complementary to what is already in Wordnet. A domain may include synsets of different syntactic categories and from different WordNet sub-hierarchies. Domains may group senses of the same word into homogeneous clusters, with the side effect of reducing word polysemy in WordNet.”
  • WordNet-Affect – “WordNet-Affect is an extension of WordNet Domains, including a subset of synsets suitable to represent affective concepts correlated with affective words. Similarly to our method for domain labels, we assigned to a number of WordNet synsets one or more affective labels (a-labels). In particular, the affective concepts representing emotional state are individuated by synsets marked with the a-label emotion. There are also other a-labels for those concepts representing moods, situations eliciting emotions, or emotional responses. The resource was extended with a set of additional a-labels (called emotional categories), hierarchically organized, in order to specialize synsets with a-label emotion. The hierarchical structure of new a-labels was modeled on the WordNet hyperonym relation. In a second stage, we introduced some modifications, in order to distinguish synsets according to emotional valence. We defined four addictional a-labels: positive, negative, ambiguous, and neutral.”

Software / applications

  • Linguistic Inquiry and Word Count – “Linguistic Inquiry and Word Count (LIWC) is a text analysis software program designed by James W. Pennebaker, Roger J. Booth, and Martha E. Francis. LIWC calculates the degree to which people use different categories of words across a wide array of texts, including emails, speeches, poems, or transcribed daily speech. With a click of a button, you can determine the degree any text uses positive or negative emotions, self-references, causal words, and 70 other language dimensions.”
  • OpinionFinder – “OpinionFinder is a system that processes documents and automatically identifies subjective sentences as well as various aspects of subjectivity within sentences, including agents who are sources of opinion, direct subjective expressions and speech events, and sentiment expressions.”
  • SenticNet – “SenticNet is a publicly available semantic resource for concept-level sentiment analysis. The affective common-sense knowledge base is built by means of sentic computing, a paradigm that exploits both AI and Semantic Web techniques to better recognize, interpret, and process natural language opinions over the Web. In particular, SenticNet exploits an ensemble of graph-mining and dimensionality-reduction techniques to bridge the conceptual and affective gap between word-level natural language data and the concept-level opinions and sentiments conveyed by them. SenticNet is a knowledge base that can be employed for the development of applications in fields such as big social data analysis, human-computer interaction, and e-health.”
  • SPECIALIST NLP Tools – “The SPECIALIST Natural Language Processing (NLP) Tools have been developed by the The Lexical Systems Group of The Lister Hill National Center for Biomedical Communications to investigate the contributions that natural language processing techniques can make to the task of mediating between the language of users and the language of online biomedical information resources. The SPECIALIST NLP Tools facilitate natural language processing by helping application developers with lexical variation and text analysis tasks in the biomedical domain. The NLP Tools are open source resources distributed subject to these [specific] terms and conditions.”
  • Visual Sentiment Ontology – “The analysis of emotion, affect and sentiment from visual content has become an exciting area in the multimedia community allowing to build new applications for brand monitoring, advertising, and opinion mining. There exists no corpora for sentiment analysis on visual content, and therefore limits the progress in this critical area. To stimulate innovative research on this challenging issue, we constructed a new benchmark and database. This database contains a Visual Sentiment Ontology (VSO) consisting of 3244 adjective noun pairs (ANP), SentiBank a set of 1200 trained visual concept detectors providing a mid-level representation of sentiment, associated training images acquired from Flickr, and a benchmark containing 603 photo tweets covering a diverse set of 21 topics. This website provides the above mentioned material for download…”

Lists of additional information

  • Lexical databases and corpora – “This is a list of links to lexical databases and corpora, organized by language or language group. The resources on this page were initially compiled from announcements on the LINGUIST list and web-search results. This is not intended to be an exhaustive list, but rather a place to organize and store potentially useful links as I [Jen Smith] encounter them.”
  • Opinion Mining, Sentiment Analysis, and Opinion Spam Detection – a long list of links pointing to articles, etc. about opinion mining.
  • Sentiment Symposium Tutorial – “This tutorial covers all aspects of building effective sentiment analysis systems for textual data, with and without sentiment-relevant metadata like star ratings. We proceed from pre-processing techniques to advanced uses cases, assessing common approaches and identifying best practices.”

Summary

What did I learn? I learned that to do sentiment analysis, lexicons are often employed. I learned that to evaluate a corpus for a particular sentiment, a researcher first needs to create a lexicon embodying that sentiment. Each element in the lexicon then needs to be assigned a quantitative value. The lexicon is then compared to the corpus tabulating the occurrences. Once tabulated, scores can then be summed, measurements taken, observations made and graphed, and conclusions/judgments made. Correct? Again, thank you, Jeffrey!

“Librarians love lists.”

by Eric Lease Morgan at July 08, 2014 07:12 PM

July 03, 2014

Life of a Librarian

What’s Eric Reading?

I have resurrected an application/system of files used to archive and disseminate things (mostly articles) I’ve been reading. I call it What’s Eric Reading? From the original About page:

I have been having fun recently indexing PDF files.

For the pasts six months or so I have been keeping the articles I’ve read in a pile, and I was rather amazed at the size of the pile. It was about a foot tall. When I read these articles I “actively” read them — meaning, I write, scribble, highlight, and annotate the text with my own special notation denoting names, keywords, definitions, citations, quotations, list items, examples, etc. This active reading process: 1) makes for better comprehension on my part, and 2) makes the articles easier to review and pick out the ideas I thought were salient. Being the librarian I am, I thought it might be cool (“kewl”) to make the articles into a collection. Thus, the beginnings of Highlights & Annotations: A Value-Added Reading List.

The techno-weenie process for creating and maintaining the content is something this community might find interesting:

  1. Print article and read it actively.
  2. Convert the printed article into a PDF file — complete with embedded OCR — with my handy-dandy ScanSnap scanner.
  3. Use MyLibrary to create metadata (author, title, date published, date read, note, keywords, facet/term combinations, local and remote URLs, etc.) describing the article.
  4. Save the PDF to my file system.
  5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr.
  6. Provide a searchable/browsable user interface to the collection through a mod_perl module.

Software is never done, and if it were then it would be called hardware. Accordingly, I know there are some things I need to do before I can truely deem the system version 1.0. At the same time my excitment is overflowing and I thought I’d share some geekdom with my fellow hackers.

Fun with PDF files and open source software.

by Eric Lease Morgan at July 03, 2014 08:36 PM

Readings

Librarians And Scholars: Partners In Digital Humanities

"Libraries have numerous capabilities and considerable expertise available to accelerate digital humanities initiatives. The University of Michigan Library developed a model for effective partnership between libraries and digital humanities scholars; this model contributes to both a definition and redefinition of this emergent field. As the U-M experience shows, using the digital humanities as a key innovation tool can help libraries and their host institutions transform the way research, teaching, and learning are conceptualized. Several real-world examples illustrate the power of collaboration in providing win-win scenarios for both librarians and scholars in the advancement of scholarship."

This was an article mostly on "how we did good."

July 03, 2014 04:00 AM

Digital Scholarship in the Humanities a Creative Arts The HuNI Virtual Laboratory

"One of the Australian national virtual laboratories, the Humanities Networked Infrastructure brings together data from 30 different data sets containing more than two million records of Australian heritage. HuNI maps the data to an overall data model and converts the data for inclusion in an aggregated store. HuNI is also assembling and adapting software tools for using and working with the aggregated data. Underlying HuNI is the recognition that cultural data is not economically, culturally, or socially insular, and researchers need to collaborate across disciplines, institutions, and social locations to explore it fully."

July 03, 2014 04:00 AM

Digital Collections As Research Infrastructure

"Given the importance of digital content to scholarship, institutions are increasingly developing strategic digitization programs to provide online access to both their reference collections and their unique and distinct materials. The internal digitization program at the National Library of Wales focuses on its collections and supports many projects, offering access to over 2,000,000 pages of historic Welsh newspapers, journals, and archives. Work on the program has yielded theoretical as well as practical results; among the former are the definition of five categories of digital content engagement: use it, share it, engage with it, enrich it, and sustain it. Using these categories as a guide can help ensure that programs add to their digital content's value, increase its impact, and ensure its maintenance as part of a shared digital research infrastructure."

July 03, 2014 04:00 AM

June 22, 2014

Mini-musings

Fun with ElasticSearch and MARC

For a good time I have started to investigate how to index MARC data using ElasticSearch. This posting outlines some of my initial investigations and hacks.

ElasticSearch seems to be an increasingly popular indexer. Getting it up an running on my Linux host was… trivial. It comes withe a full-fledged Perl interface. Nice! Since ElasticSearch takes JSON as input, I needed to serialize my MARC data accordingly, and MARC::File::JSON seems to do a fine job. With this in hand, I wrote three programs:

  1. index.pl – create an index of MARC records
  2. get.pl – retrieve a specific record from the index
  3. search.pl – query the index

I have some work to do, obviously. First of all, do I really want to index MARC in its raw, communications format? I don’t think so, but that is where I’ll start. Second, the search script doesn’t really search. Instead it simply gets all the records. This is because I really don’t know how to search yet; I don’t really know how to query fields like “245 subfield a”.

index.pl

#!/usr/bin/perl

# configure
use constant INDEX => 'pamphlets';
use constant MARC  => './pamphlets.marc';
use constant MAX   => 100;
use constant TYPE  => 'marc';

# require
use MARC::Batch;
use MARC::File::JSON;
use Search::Elasticsearch;
use strict;

# initialize
my $batch = MARC::Batch->new( 'USMARC', MARC );
my $count = 0;
my $e     = Search::Elasticsearch->new;

# process each record in the batch
while ( my $record = $batch->next ) {

  # debug
  print $record->title, "\n";
  
  # serialize the record into json
  my $json = &MARC::File::JSON::encode( $record );
  
  # increment
  $count++;
  
  # index; do the work
  $e->index(  index   => INDEX,
                type    => TYPE,
                id      => $count,
                body    => { "$json" }
    );
    
  # check; only do a few
  last if ( $count > MAX );
  
}

# done
exit;

get.pl

# configure 
use constant INDEX => 'pamphlets';
use constant TYPE  => 'marc';

# require
use MARC::File::JSON;
use Search::Elasticsearch;
use strict;

# initialize
my $e = Search::Elasticsearch->new;

# get; do the work
my $doc = $e->get( index   => INDEX,
                   type    => TYPE,
                   id      => $ARGV[ 0 ]
);

# reformat and output; done
my $record = MARC::Record->new_from_json( keys( $doc->{ '_source' } ) );
print $record->as_formatted, "\n";
exit;

search.pl

# configure 
use constant INDEX => 'pamphlets';

# require
use MARC::File::JSON;
use Search::Elasticsearch;
use strict;

# initialize
my $e = Search::Elasticsearch->new;

# search
my $results = $e->search(
  index => INDEX,
    body  => { query => { match_all => { $ARGV[ 0 ] } } }
);

# output
my $hits = $results->{ 'hits' }->{ 'hits' };
for ( my $i = 0; $i <= $#$hits; $i++ ) {

  my $record = MARC::Record->new_from_json( keys( $$hits[ $i ]->{ '_source' } ) );
  print $record->as_formatted, "\n\n";

}

# done
exit;

by Eric Lease Morgan at June 22, 2014 03:40 PM

June 16, 2014

Life of a Librarian

Visualising Data: A Travelogue


Last month a number of us from the Hesburgh Libraries attended a day-long workshop on data visualisation facilitated by Andy Kirk of Visualising Data. This posting documents some of the things I learned.

First and foremost, we were told there are five steps to creating data visualisations. From the handouts and supplemented with my own understanding, they include:

  1. establishing purpose – This is where you ask yourself, “Why is a visualisation important here? What is the context of the visualization?
  2. acquiring, preparing and familiarising yourself with the data – Here were echoed different data types (open, nominal, ordinal, intervals, and ratios), and we were introduced to the hidden costs of massaging and enhancing data, which is something I do with text mining and others do in statistical analysis.
  3. establishing editorial focus – This is about asking and answering questions regarding the visualisation’s audience. What is their education level? How much time will they have to absorb the content? What medium(s) may be best used for the message?
  4. conceiving the design – Using just paper and pencil, draw, brainstorm, and outline the appearance of the visualisation.
  5. constructing the visualisation – Finally, do the work of making the visualisation a reality. Increasingly this work is done by exploiting the functionality of computers, specifically for the Web.

Here are a few meaty quotes:

  • Context is king.
  • Data preparation is a hidden cost in visualization.
  • Data visualisation is a tool for understanding, not fancy ways of showing numbers.
  • Data visualisation is about analysis and communication.

One of my biggest take-aways was the juxtaposition of two spectrum: reading to feeling, and explaining to exploring. In other words, to what degree is the visualization expected to be read or felt, and to what degree is it offering the possibilities to explain or explore the data? Kirk illustrated the idea like this:

                read
                 .
                / \
                 |
                 |
   explain <-----+-----> explore
                 |
                 |
                \ /
                 .
                feel

The the reading/feeling spectrum reminded me of the usability book entitled Don’t Make Me Think. The explaining/exploring spectrum made me consider interactivity in visualisations.

I learned two other things along the way: 1) creating visualisations is a team effort requiring a constellation of skilled people (graphic designers, statisticians, content specialists, computer technologists, etc.), and 2) is it entirely plausible to combine more than one graphic — data set illustration — into a single visualisation.

Now I just need to figure out how to put these visualisation techniques into practice.

by Eric Lease Morgan at June 16, 2014 07:05 PM

June 13, 2014

Life of a Librarian

ORCID Outreach Meeting (May 21 & 22, 2014)

This posting documents some of my experiences at the ORCID Outreach Meeting in Chicago (May 21 & 22, 2014).

As you may or may now know, ORCID is an acronym for “Open Researcher and Contributor ID”.* It is also the name of a non-profit organization whose purpose is to facilitate the creation and maintenance of identifiers for scholars, researchers, and academics. From ORCID’s mission statement:

ORCID aims to solve the name ambiguity problem in research and scholarly communications by creating a central registry of unique identifiers for individual researchers and an open and transparent linking mechanism between ORCID and other current researcher ID schemes. These identifiers, and the relationships among them, can be linked to the researcher’s output to enhance the scientific discovery process and to improve the efficiency of research funding and collaboration within the research community.

A few weeks ago the ORCID folks facilitated a user’s group meeting. It was attended by approximately 125 people (mostly librarians or people who work in/around libraries), and some of the attendees came from as far away as Japan. The purpose of the meeting was to build community and provide an opportunity to share experiences.

The meeting itself was divided into number of panel discussions and a “codefest”. The panel discussions described successes (and failures) for creating, maintaining, enhancing, and integrating ORCID identifiers into workflows, institutional repositories, grant application processes, and information systems. Presenters described poster sessions, marketing materials, information sessions, computerized systems, policies, and politics all surrounding the implementation of ORCID identifiers. Quite frankly, nobody seemed to have a hugely successful story to tell because too few researchers seem to think there is a need for identifiers. I, as a librarian and information professional, understand the problem (as well as the solution), but outside the profession there may not seem to be much of a problem to be solved.

That said, the primary purpose of my attendance was to participate in the codefest. There were less than a dozen of us coders, and we all wanted to use the various ORCID APIs to create new and useful applications. I was most interested in the possibilities of exploiting the RDF output obtainable through content negotiation against an ORCID identifier, a la the command line application called curl:

curl -L -H "Accept: application/rdf+xml" http://orcid.org/0000-0002-9952-7800

Unfortunately, the RDF output only included the merest of FOAF-based information, and I was interested in bibliographic citations.

Consequently I shifted gears, took advantage of the ORCID-specific API, and I decided to do some text mining. Specifically, I wrote a Perl program — orcid.pl — that takes an ORCID identifier as input (ie. 0000-0002-9952-7800) and then:

  1. queries ORCID for all the works associated with the identifier**
  2. extracts the DOIs from the resulting XML
  3. feeds the DOIs to a program called Tika for the purposes of extracting the full text from documents
  4. concatenates the result into a single stream of text, and sends the whole thing to standard output

For example, the following command will create a “bag of words” containing the content of all the writings associated with my ORCID identifier and have DOIs:

$ ./orcid.pl 0000-0002-9952-7800 > morgan.txt

Using this program I proceeded to create a corpus of files based on the ORCID identifiers of eleven Outreach Meeting attendees. I then used my “tiny text mining tools” to do analysis against the corpus. The results were somewhat surprising:

  • The most significant key words shared across the corpus of eleven people included: information, system, site, and orcid.
  • The authors Haak and Paglione wrote the most similar articles. (They both wrote about ORCID.) Morgan and Havert were a very close second. (We both wrote about “information” and “sites”.)
  • The DOIs often point to splash pages, and consequently my “bags of words” included lots of content about cookies and publishers as opposed to meaty journal article content. ***

Ideally, the hack I wrote would allow a person to feed one or more identifiers to a system and output a report summarizing and analyzing the journal article content at a glance — a quick & easy “distant reading” tool.

I finished my “hack” in one sitting which gave me time to attend the presentations of the second day.

All of the hacks were added to a pile and judged by a vendor on their utility. I’m proud to say that Jeremy Friesen’s — a colleague here at Notre Dame — hack won a prize. His application followed the links to people’s publications, created a screen dump of the publications’ root pages, and made a montage of the result. It was a visual version of orcid.pl. Congratulations, Jeremy!

I’m very glad I attended the Meeting. I reconnected with a number of professional colleagues, and I my awareness of researcher identifiers was increased. More specifically, there seem to be a growing number of these identifiers. Examples for myself include:

And for a really geeky good time, I learned to create the following set of RDF triples with the use of these identifiers:

@prefix dc: <http://purl.org/dc/elements/1.1/> .
  <http://dx.doi.org/10.1108/07378831211213201> dc:creator
  "http://isni.org/isni/0000000035290715" ,
  "http://id.loc.gov/authorities/names/n94036700" ,
  "http://orcid.org/0000-0002-9952-7800" ,
  "http://viaf.org/viaf/26290254" ,
  "http://www.researcherid.com/rid/F-2062-2014" ,
  "http://www.scopus.com/authid/detail.url?authorId=25944695600" .

I learned about the (subtle) difference between an identifier and a authority control record. I learned of the advantages and disadvantages the various identifiers. And through a number of serendipitous email exchanges, I learned about ISNIs which are a NISO standard for identifiers and seemingly popular in Europe but relatively unknown here in the United States. For more detail, see the short discussion of these things in the Code4Lib mailing list archives.

Now might be a good time for some of my own grassroots efforts to promote the use of ORCID identifiers.

* Thanks, Pam Masamitsu!

** For a good time, try http://pub.orcid.org/0000-0002-9952-7800/orcid-works, or substitute your identifier to see a list of your publications.

*** The problem with splash screens is exactly what the very recent CrossRef Text And Data Mining API is designed to address.

by Eric Lease Morgan at June 13, 2014 03:04 PM

June 10, 2014

Life of a Librarian

CrossRef’s Text and Data Mining (TDM) API

A few weeks ago I learned that CrossRef’s Text And Data Mining (TDM) API had gone version 1.0, and this blog posting describes my tertiary experience with it.

A number of months ago I learned about Prospect, a fledgling API being developed by CrossRef. Its purpose was to facilitate direct access to full text journal content without going through the hassle of screen scraping journal article splash pages. Since then the API has been upgraded to version 1.0 and renamed the Text And Data Mining API. This is how the API is expected to be used:

  1. Given a (CrossRef) DOI, resolve the DOI using HTTP content negotiation. Specifically, request text/turtle output.
  2. From the response, capture the HTTP header called “links”.
  3. Parse the links header to extract URIs denoting full text, licenses, and people.
  4. Make choices based on the values of the URIs.

What sorts of choices is one expected to make? Good question. First and foremost, a person is suppose to evaluate the license URI. If the URI points to a palatable license, then you may want to download the full text which seems to come in PDF and/or XML flavors. With version 1.0 of the API, I have discovered ORCID identifiers are included in the header. I believe these denote authors/contributors of the articles.

Again, all of this is based on the content of the HTTP links header. Here is an example header, with carriage returns added for readability:

<http://downloads.hindawi.com/journals/isrn.neurology/2013/908317.pdf>;
rel="http://id.crossref.org/schema/fulltext"; type="application/pdf"; version="vor",
<http://downloads.hindawi.com/journals/isrn.neurology/2013/908317.xml>;
rel="http://id.crossref.org/schema/fulltext"; type="application/xml"; version="vor",
<http://creativecommons.org/licenses/by/3.0/>; rel="http://id.crossref.org/schema/license";
version="vor", <http://orcid.org/0000-0002-8443-5196>; rel="http://id.crossref.org/schema/person",
<http://orcid.org/0000-0002-0987-9651>; rel="http://id.crossref.org/schema/person",
<http://orcid.org/0000-0003-4669-8769>; rel="http://id.crossref.org/schema/person"

I wrote a tiny Perl library — extractor.pl — used to do steps #1 through #3, above. It returns a reference to a hash containing the values in the links header. I then wrote three Perl scripts which exploit the library:

  1. resolver.cgi – a Web-based application taking a DOI as input and returning the URIs in the links header, if they exist. Your milage with the script will vary because most DOIs are not associated with full text URIs.
  2. search.cgi – given a simple query, use CrossRef’s Metadata API to find no more than five articles associated with full text content, and then resolve the links to the full text.
  3. search.pl – a command-line version of search.cgi

Here are a few comments. Myself, as a person who increasingly wants direct access to full text articles, the Text And Data Mining API is a step in the right direction. Now all that needs to happen is for publishers to get on board and feed CrossRef the URIs of full text content along the associated licensing terms. I found the links header to be a bit convoluted, but this is what programming libraries are for. I could not find a comprehensive description of what name/value combinations can exist in the links header. For example, the documentation alludes to beginning and ending dates. CrossRef seems to have a growing number of interesting applications and APIs which are probably going unnoticed, and there is an opportunity of some sort lurking in there. Specifically, somebody out to do something the text/turtle (RDF) output of the DOI resolutions.

‘More fun with HTTP and bibliographics.

by Eric Lease Morgan at June 10, 2014 07:09 PM

June 09, 2014

Readings

Ranking and extraction of relevant single words in text

Describes a technique for extracting the key (significant) words from a text

June 09, 2014 04:00 AM

Level statistics of words: Finding keywords in literary texts and symbolic sequences

"Using a generalization of the level statistics analysis of quantum disordered systems, we present an approach able to extract automatically keywords in literary texts. Our approach takes into account not only the frequencies of the words present in the text but also their spatial distribution along the text, and is based on the fact that relevant words are significantly clustered (i.e., they self-attract each other), while irrelevant words are distributed randomly in the text..."

June 09, 2014 04:00 AM

June 05, 2014

Readings

Corpus Stylistics, Stylometry, and the Styles of Henry James

"Stylometry provides powerful techniques for examining authorial style variation. This study uses several such techniques to explore the traditional distinction between James's early and late styles. They confirm this distinction, identify an intermediate style, and facilitate an analysis of the lexical character of James's style. Especially revealing are techniques that identify words with extremely variable frequencies across James's oeuvre-words that clearly characterize the various period styles. Such words disproportionately increase or decrease steadily throughout James's remarkably unidirectional stylistic development. Stylometric techniques constitute a promising avenue of research that exploits the power of corpus analysis and returns our attention to a manageable subset of an author's vocabulary."

I learned about various stlyometric techniques such as Delta, and to some degree PCA.

June 05, 2014 04:00 AM

May 28, 2014

Readings

Narrative framing of consumer sentiment in online restaurant reviews

"The vast increase in online expressions of consumer sentiment offers a powerful new tool for studying consumer attitudes. To explore the narratives that consumers use to frame positive and negative sentiment online, we computationally investigate linguistic structure in 900,000 online restaurant reviews. Negative reviews, especially in expensive restaurants, were more likely to use features previously associated with narratives of trauma: negative emotional vocabulary, a focus on the past actions of third person actors such as waiters, and increased use of references to "we" and "us", suggesting that negative reviews function as a means of coping with service-related trauma. Positive reviews also employed framings contextualized by expense: inexpensive restaurant reviews use the language of addiction to frame the reviewer as craving fatty or starchy foods. Positive reviews of expensive restaurants were long narratives using long words emphasizing the reviewer's linguistic capital and also focusing on sensory pleasure. Our results demonstrate that portraying the self, whether as well-educated, as a victim, or even as addicted to chocolate, is a key function of reviews and suggests the important role of online reviews in exploring social psychological variables."

Very interesting use of lexicons. Bad restaurant reviews were associated with interpersonal interactions. Good reviews were associated with sensual pleasure.

  • Creator(s): Jurafsk, Dan; et al.
  • Date created: 2014-03-17
  • Date read: 2014-05-28
  • Facet/terms: Formats/Journal articles; Themes/Sentiment Analysis;
  • Rights: Open
  • Source: Narrative framing of consumer sentiment in online restaurant reviews by Dan Jurafsky, Victor Chahuneau, Bryan R. Routledge, and Noah A. Smith. First Monday, Volume 19, Number 4 - 7 April 2014 http://firstmonday.org/ojs/index.php/fm/article/view/4944/3863
  • Versions(s): original; local/annotated

May 28, 2014 04:00 AM

May 15, 2014

Life of a Librarian

Code4Lib jobs topic

entrance to romeThis posting describes how to turn off and on a thing called the jobs topic in the Code4Lib mailing list.

Code4Lib is a mailing list whose primary focus is computers and libraries. Since its inception in 2004, it has grown to include about 2,800 members from all around the world but mostly from the United States. The Code4Lib community has also spawned an annual conference, a refereed online journal, its own domain, and a growing number of regional “franchises”.

The Code4Lib community has also spawned job postings. Sometimes these job postings flood the mailing list, and while it is entirely possible use mail filters to exclude such postings, there is also “more than one way to skin a cat”. Since the mailing list uses the LISTSERV software, the mailing list has been configured to support the idea of “topics“, and through this feature a person can configure their subscription preferences to exclude job postings. Here’s how. By default every subscriber to the mailing list will get all postings. If you want to turn off getting the jobs postings, then email the following command to listserv@listserv.nd.edu:

SET code4lib TOPICS: -JOBS

If you want to turn on the jobs topic and receive the notices, then email the following command to listserv@listserv.nd.edu:

SET code4lib TOPICS: +JOBS

Sorry, but if you subscribe to the mailing list in digest mode, then the topics command has no effect; you will get the job postings no matter what.

HTH.

Special thanks go to Jodi Schneider and Joe Hourcle who pointed me in the direction of this LISTSERV functionality. Thank you!

by Eric Lease Morgan at May 15, 2014 03:59 PM

Date created: 2000-05-19
Date updated: 2011-05-03
URL: http://infomotions.com/