April 23, 2014

LiAM: Linked Archival Metadata

Trends and gaps in linked data for archives

“A funny thing happened on the way to the forum.”

Two travelogues

roman forum Two recent professional meetings have taught me that — when creating some sort of information service — linked data will reside and be mixed with data collected from any number of Internet sites. Linked data interfaces will coexist with REST-ful interfaces, or even things as rudimentary as FTP. To the archivist, this means linked data is not the be-all and end-all of information publishing. There is no such thing. To the application programmer, this means you will need to have experience with a ever-growing number of Internet protocols. To both it means, “There is more than one way to get there.”

In October of 2013 I had the opportunity to attend the Semantic Web In Libraries conference. It was a three-day event attended by approximately three hundred people who could roughly be divided into two equally sized groups: computer scientists and cultural heritage institution employees. The bulk of the presentations fell into two categories: 1) publishing linked data, and 2) creating information services. The publishers talked about ontologies, human-computer interfaces for data creation/maintenance, and systems exposing RDF to the wider world. The people creating information services were invariably collecting, homogenizing, and adding value to data gathered from a diverse set of information services. These information services were not limited to sets of linked data. They also included services accessible via REST-ful computing techniques, OAI-PMH interfaces, and there were probably a few locally developed file transfers or relational database dumps described as well. These people where creating lists of information services, regularly harvesting content from the services, writing cross-walks, locally storing the content, indexing it, providing services against the result, and sometimes republishing any number of “stories” based on the data. For the second group of people, linked data was certainly not the only game in town.

In February of 2014 I had the opportunity to attend a hackathon called GLAM Hack Philly. A wide variety of data sets were presented for “hacking” against. Some where TEI files describing Icelandic manuscripts. Some was linked data published from the British museum. Some was XML describing digitized journals created by a vendor-based application. Some of it resided in proprietary database applications describing the location of houses in Philadelphia. Some of it had little or no computer-readable structure and described plants. Some of it was the wiki mark-up for local municipalities. After the attendees (there were about two dozen of us) learned about each of the data sets we self-selected and hacked away at projects of our own design. The results fell into roughly three categories: geo-referencing objects, creating searchable/browsable interfaces, and data enhancement. With the exception of the hack repurposing journal content into visual art, the results were pretty typical for cultural heritage institutions. But what fascinated me was way us hackers selected our data sets. Namely, the more complete and well-structured the data the more hackers gravitated towards it. Of all the data sets, the TEI files were the most complete, accurate, and computer-readable. Three or four projects were done against the TEI. (Heck, I even hacked on the TEI files.) The linked data from the British Museum — very well structured but not quite as through at the TEI — attracted a large number of hackers who worked together for a common goal. All the other data sets had only one or two people working on them. What is the moral to the story? There are two of them. First, archivists, if you want people to process your data and do “kewl” things against it, then make sure the data is thorough, complete, and computer-readable. Second, computer programmers, you will need to know a variety of data formats. Linked data is not the only game in town.

The technologies described in this Guidebook are not the only way to accomplish the goals of archivists wishing to make their content more accessible. Instead, linked data is just one of many protocols in the toolbox. It is open, standards-based, and simpler rather than more complex. On the other hand, other protocols exist which have a different set of strengths and weaknesses. Computer technologists will need to have a larger rather than smaller knowledge of various Internet tools. For archivists, the core of the problem is still the collection and description of content. This — a what of archival practice — continues to remain constant. It is the how of archival practice — the technology — that changes at a much faster pace.

With great interest I read the Spring/Summer issue of Information Standards Quarterly entitled “Linked Data in Libraries, Archives, and Museums” where there were a number of articles pertaining to linked data in cultural heritage institutions. Of particular interest to me were the loosely enumerated challenges of linked data. Some of them included:

  • the apparent Tower Of Babel when it comes to vocabularies used to describe content, and the same time we need to have “ontology mindfulness”.
  • dirty, inconsistent, or wide varieties of data integrity
  • persistent URIs
  • the “chicken & egg” problem of why linked data if there is no killer application

There are a number of challenges in the linked data process. Some of them are listed below, and some of them have been alluded to previously. Create useful linked data, meaning, create linked that links to other linked data. Linked data does not live in a world by itself. Remember, the “l” stands for “linked”. For example, try to include URIs that are the URIs used on other linked data sets. Sometimes this is not possible, for example, with the names of people in archival materials. When possible, they used VIAF, but other times they needed to create their own URI denoting an individual. There is a level of rigor involved in creating the data model, and there may be many discussions regarding semantics. For example, what is a creator? Or, when is a term intended to be an index term as opposed to a reference. When does one term in one vocabulary equal a different term in a different vocabulary? Balance the creation of your own vocabulary with the need to speak the language of others using their vocabulary. Consider “fixing” the data as it comes in or goes out because it might not be consistent nor thorough. Provenance is an issue. People — especially scholars — will want to know where the linked data came from and whether or not it is authoritative. How to solve or address this problem? The jury is still out on this one. Creating and maintaining linked data is difficult because it requires the skills of a number of different types of people. Computer programmers. Database designers. Subject experts. Metadata specialists. Archivists. Etc. A team is all but necessary.

Linked data represents a modern way of making your archival descriptions accessible to the wider world. In that light, it represents a different way of doing things but not necessary a different what of doing things. You will still be doing inventory. You will still be curating collections. You will still be prioritizing what goes and what stays.

Gaps

Linked data makes a lot of sense, but there are some personnel and technological gaps needing to be filled before it can really and truly be widely adopted by archives (or libraries or museums). They include but are not limited to: hands-on training, “string2URI” tools, database to RDF interfaces, mass RDF editors, and maybe “killer applications”.

Hands-on training

Different people learn in different ways, and hands-on training on what linked data is and how it can be put into practice would go a long way towards the adoption of linked data in archives. These hands-on sessions could be as short as an hour or as long as one or two days. They would include a mixture of conceptual and technological topics. For example, there could be a tutorial on how to search RDF triple stores using SPARQL. Another tutorial would compare & contrast the data models of databases with the RDF data model. A class could be facilitated on how to transform XML files (MARCXML, MODS, EAD) to any number of RDF serializations and publish the result on a Web server. There could be a class on how to design URIs. A class on how to literally draw an RDF ontology would be a good idea. Another class would instruct people on how to formally read & write an ontology using OWL. Yet another hands-on workshop would demonstrate to participants the techniques for creating, maintaining, and publishing an RDF triple store. Etc. Linked data might be a “good thing”, but people are going to need to learn how to work more directly with it. These hands-on trainings could be aligned with hack-a-thons, hack-fests, or THATCamps so a mixture of archivists, metadata specialists, and computer programmers would be in the same spaces at the same times.

string2URI

There is a need for tools enabling people and computers to automatically associate string literals with URIs. If nobody (or relatively few people) share URIs across their published linked data, then the promises of linked data won’t come to fruition. Archivists (and librarians and people who work in museums) take things like controlled vocabularies and name authority lists very seriously. Identifying the “best” URI for a given thing, subject term, or personal name is something the profession is going to want to do and do well.

Fabian Steeg and Pascal Christoph at the 2013 Semantic Web in Libraries conference asked the question, “How can we benefit from linked data without being linked data experts?” Their solution was the creation of a set of tools enabling people to query a remote service and get back a list of URIs which were automatically inserted into a text. This is an example of a “string2URI” tool that needs to be written and widely adopted. These tools could be as simple as a one-box, one-button interface where a person enters a word for phrase and one or more URIs are returned for selection. A slightly more complicated version would include a drop-down menu allowing the person to select places to query for the URI. Another application suggested by quite a number of people would use natural language to first extract named entities (people, places, things, etc.) from texts (like abstracts, scope notes, biographical histories, etc.). Once these entities were extracted, they would then be fed to string2URI. The LC Linked Data Service, VIAF, and Worldcat are very good examples of string2URI tools. The profession needs more of them. SNAC’s use of EAC-CPF is a something to watch in this space.

Database to RDF publishing systems

There are distinct advantages and disadvantages of the current ways of creating and maintaining the descriptions of archival collections. They fill a particular purpose and function. Nobody is going to suddenly abandon well-known techniques for ones seemingly unproven. Consequenlty, there is a need to easily migrate existing data to RDF. One way towards this goal is to transform or export archival descriptions from their current containers to RDF. D2RQ could go a long way towards publishing the underlying databases of PastPerfect, Archon, Archivist’s Toolkit, or ArchivesSpace as RDF. A seemingly little used database to RDF modeling language — R2RML — could be used for similar purposes. These particular solutions are rather generic. Either a great deal of customizations needs to be done using D2RQ, or new interfaces to the underlying databases need to be created. Regarding the later, this will require a large amount of specialized work. An ontology & vocabulary would need to be designed or selected. The data and the structure of the underlying databases would need to be closely examined. A programmer would need to write reports against the database to export RDF and publish it in one form or another. Be forewarned. Software, like archival description, is never done. On the other hand, this sort of work could be done once and then shared with the wider archival community and then applied to local implementations of Archivist’s Toolkit, ArchivesSpace, etc.

Mass RDF editors

Archivists curate physical collections as well as descriptions of those collections. Ideally, the descriptions would reside in a triple store as if it were a database. The store would be indexed. Queries could be applied against the store. Create, read, update, and delete operations could be easily done. As RDF is amassed it will almost definitely need to be massaged and improved. URIs may need to be equated. Controlled vocabulary terms may need to be related. Supplementary statements may need to be asserted enhancing the overall value of the store. String literals may need to be normalized or even added. This work will not be done on a one-by-one statement-by-statement basis. There are simply too many triples — 100′s of thousands, if not millions of them. Some sort of mass RDF editor will need to be created. If the store was well managed, and if a peson was well-versed in SPARQL, then much of this work could be done through SPARQL statements. But SPARQL is not for the faint of heart, and despite what some people say, it is not easy to write. Tools will need to be created — akin to the tools described by Diane Hillman and articulated through the experience with the National Science Foundation Digital Library — making it easy to do large-scale additions and updates to RDF triple stores.

“Killer” applications

To some degree, the idea of the Semantic Web and linked data has been oversold. We were told, “Make very large sets of RDF freely available and new relationships between resources will be discovered.” The whole thing smacks of artificial intelligence which simultaneously scares people and makes them laugh out loud. On the other hand, a close reading of Allemang’s and Hendler’s book Semantic Web For The Working Ontologist describes exactly how and why these new relationships can be discovered, but these discoveries do take some work and a significant volume of RDF from a diverse set of domains.

So maybe the “killer” application is not so much a sophisticated super-brained inference engine but something less sublime. A number of examples come to mind. Begin by recreating the traditional finding aid. Transform an EAD file into serialized RDF, and from the RDF create a finding aid. This is rather mundane and redundant but it will demonstrate and support a service model going forward. Go three steps further. First, create a finding aid, but supplement it with data and information from your other finding aids. Your collections do not exist in silos nor isolation. Supplement these second-generation finding aids with images, geographic coordinates, links to scholarly articles, or more thorough textual notes all discovered from outside linked data sources. Third, establish relationships between your archival collections and the archival collections outside your institution. Again, relationships could be created between collections and between items in the collections. These sorts of “killer” applications enables the archivist to stretch the definition of finding aid.

Another “killer” application may be a sort of union catalog. Each example of the union catalog will have some sort of common domain. The domain could be topical: Catholic studies, civil war history, the papers of a particular individual or organization. Collect the RDF from archives of these domains, put it into a single triple store, clean up and enhance the RDF, index it, and provide a search engine against the index. The domain could be regional. For example, the RDF from an archive, library, and museum of an individual college or university could be created, amalgamated, and presented. The domain could be professional: all archives, all libraries, or all museums.

Another killer application, especially in an academic environment, would be the integration of archival description into course management systems. Manifest archival descriptions as RDF. Manifest course offerings across the academy in the form of RDF. Manifest student and instrutor information as RDF. Discover and provide links between archival content and people in specific classes. This sort of application will make archival collections much more relevant to the local population.

Tell stories. Don’t just provide links. People want answers as much as they want lists of references. After search queries are sent to indexes, provide search results in the form of lists of links, but also mash together information from search results into a “named graph” that includes an overview of the apparent subject queried, images of the subject, a depiction of where the subject is located, and few very relevant and well-respected links to narrative descriptions of the subject. You can see these sorts of enhancement in the results of many Google and Facebook search results.

Support the work of digital humanists. Amass RDF. Clean, normalize, and enhance it. Provide access to it via searchable and browsable interfaces. Provide additional services against the results such as timelines built from the underlying dates found in the RDF. Create word clouds based on statistically significant entities such as names of people or places or themes. Provide access to search results in the form of delimited files so the data can be imported into other tools for more sophisticated analysis. For example, support a search results to Omeka interface. For that matter, create an Omeka to RDF service.

The “killer” application for linked data is only as far away as your imagination. If you can articulate it, then it can probably be created.

Last word

Linked data changes the way your descriptions get expressed and distributed. It is a lot like taking a trip across country. The goal was always to get to the coast to see the ocean, but instead of walking, going by stage coach, taking a train, or driving a car, you will be flying. Along the way you may visit a few cities and have a few layovers. Bad weather may even get in the way, but sooner or later you will get to your destination. Take a deep breath. Understand that the process will be one of learning, and that learning will be applicable in other aspects of your work. The result will be two-fold. First, a greater number of people will have access to your collections, and consequently, more people will will be using your collections.

by Eric Lease Morgan at April 23, 2014 07:25 PM

April 21, 2014

LiAM: Linked Archival Metadata

LiAM Guidebook: Executive summary

spanish steps Linked data is a process for embedding the descriptive information of archives into the very fabric of the Web. By transforming archival description into linked data, an archivist will enable other people as well as computers to read and use their archival description, even if the others are not a part of the archival community. The process goes both ways. Linked data also empowers archivists to use and incorporate the information of other linked data providers into their local description. This enables archivists to make their descriptions more thorough, more complete, and more value-added. For example, archival collections could be automatically supplemented with geographic coordinates in order to make maps, images of people or additional biographic descriptions to make collections come alive, or bibliographies for further reading.

Publishing and using linked data does not represent a change in the definition of archival description, but it does represent an evolution of how archival description is accomplished. For example, linked data is not about generating a document such as EAD file. Instead it is about asserting sets of statements about an archival thing, and then allowing those statements to be brought together in any number of ways for any number of purposes. A finding aid is one such purpose. Indexing is another purpose. For use by a digital humanist is anther purpose. While EAD files are encoded as XML documents and therefore very computer readable, the reader must know the structure of EAD in order to make the most out of the data. EAD is archives-centric. The way data is manifested in linked data is domain-agnostic.

The objectives of archives include collection, organization, preservation, description, and often times access to unique materials. Linked data is about description and access. By taking advantages of linked data principles, archives will be able to improve their descriptions and increase access. This will require a shift in the way things get done but not what gets done. The goal remains the same.

Many tools are ready exist for transforming data in existing formats into linked data. This data can reside in Excel spreadsheets, database applications, MARC records, or EAD files. There are tiers of linked data publishing so one does not have to do everything all at once. But to transform existing information or to maintain information over the long haul requires the skills of many people: archivists & content specialists, administrators & managers, metadata specialists & catalogers, computer programers & systems administrators.

Moving forward with linked data is a lot like touristing to Rome. There are many ways to get there, and there are many things to do once you arrive, but the result will undoubtably improve your ability to participate in the discussion of the human condition on a world wide scale.

by Eric Lease Morgan at April 21, 2014 08:59 PM

April 18, 2014

LiAM: Linked Archival Metadata

Rome in three days, an archivists introduction to linked data publishing

If you to go to Rome for a few days, do everything you would do in a single day, eat and drink in a few cafes, see a few fountains, and go to a museum of your choice.

trevi fountain Linked data in archival practice is not new. Others have been here previously. You can benefit from their experience and begin publishing linked data right now using tools with which you are probably already familiar. For example, you probably have EAD files, sets of MARC records, or metadata saved in database applications. Using existing tools, you can transform this content into RDF and put the result on the Web, thus publishing your information as linked data.

EAD

If you have used EAD to describe your collections, then you can easily make your descriptions available as valid linked data, but the result will be less than optimal. This is true not for a lack of technology but rather from the inherent purpose and structure of EAD files.

A few years ago an organisation in the United Kingdom called the Archive’s Hub was funded by a granting agency called JISC to explore the publishing of archival descriptions as linked data. The project was called LOCAH. One of the outcomes of this effort was the creation of an XSL stylesheet (ead2rdf) transforming EAD into RDF/XML. The terms used in the stylesheet originate from quite a number of standardized, widely accepted ontologies, and with only the tiniest bit configuration / customization the stylesheet can transform a generic EAD file into valid RDF/XML for use by anybody. The resulting XML files can then be made available on a Web server or incorporated into a triple store. This goes a long way to publishing archival descriptions as linked data. The only additional things needed are a transformation of EAD into HTML and the configuration of a Web server to do content negotiation between the XML and HTML.

For the smaller archive with only a few hundred EAD files whose content does not change very quickly, this is a simple, feasible, and practical solution to publishing archival descriptions as linked data. With the exception of doing some content negotiation, this solution does not require any computer technology that is not already being used in archives, and it only requires a few small tweaks to a given workflow:

  1. implement a content negotiation solution

  2. create and maintain EAD file
s
  3. transform EAD into RDF/XML

  4. transform EAD into HTML

  5. save the resulting XML and HTML files on a Web server

  6. go to step #2

EAD is a combination of narrative description and a hierarchal inventory list, and this data structure does not lend itself very well to the triples of linked data. For example, EAD headers are full of controlled vocabularies terms but there is no way to link these terms with specific inventory items. This is because the vocabulary terms are expected to describe the collection as a whole, not individual things. This problem could be overcome if each individual component of the EAD were associated with controlled vocabulary terms, but this would significantly increase the amount of work needed to create the EAD files in the first place.

The common practice of using literals to denote the names of people, places, and things in EAD files would also need to be changed in order to fully realize the vision of linked data. Specifically, it would be necessary for archivists to supplement their EAD files with commonly used URIs denoting subject headings and named authorities. These URIs could be inserted into id attributes throughout an EAD file, and the resulting RDF would be more linkable, but the labor to do so would increase, especially since many of the named items will not exist in standardized authority lists.

Despite these short comings, transforming EAD files into some sort of serialized RDF goes a long way towards publishing archival descriptions as linked data. This particular process is a good beginning and outputs valid information, just information that is not as linkable as possible. This process lends itself to iterative improvements, and outputting something is better than outputting nothing. But this particular proces is not for everybody. The archive whose content changes quickly, the archive with copious numbers of collections, or the archive wishing to publish the most complete linked data possible will probably not want to use EAD files as the root of their publishing system. Instead some sort of database application is probably the best solution.

MARC

In some ways MARC lends it self very well to being published via linked data, but in the long run it is not really a feasible data structure.

Converting MARC into serialized RDF through XSLT is at least a two step process. The first step is to convert MARC into MARCXML and then MARCXML into MODS. This can be done with any number of scripting languages and toolboxes. The second step is to use a stylesheet such as the one created by Stefano Mazzocchi to transform the MODS into RDF/XML — mods2rdf.xsl From there a person could save the resulting XML files on a Web server, enhance access via content negotiation, and called it linked data.

Unfortunately, this particular approach has a number of drawbacks. First and foremost, the MARC format had no place to denote URIs; MARC records are made up almost entirely of literals. Sure, URIs can be constructed from various control numbers, but things like authors, titles, subject headings, and added entries will most certainly be literals (“Mark Twain”, “Adventures of Huckleberry Finn”, “Bildungsroman”, or “Samuel Clemans”), not URIs. This issue can be overcome if the MARCXML were first converted into MODS and URIs were inserted into id or xlink attributes of bibliographic elements, but this is extra work. If an archive were to take this approach, then it would also behoove them to use MODS as their data structure of choice, not MARC. Continually converting from MARC to MARCXML to MODS would be expensive in terms of time. Moreover, with each new conversion the URIs from previous iterations would need to be re-created.

EAC-CPF

Encoded Archival Context for Corporate Bodies, Persons, and Families (EAC-CPF) goes a long way to implementing a named authority database that could be linked from archival descriptions. These XML files could easily be transformed into serialized RDF and therefore linked data. The resulting URIs could then be incorporated into archival descriptions making the descriptions richer and more complete. For example the FindAndConnect site in Australia uses EAC-CPF under the hood to disseminate information about people in its collection. Similarly, “SNAC aims to not only make the [EAC-CPF] records more easily discovered and accessed but also, and at the same time, build an unprecedented resource that provides access to the socio-historical contexts (which includes people, families, and corporate bodies) in which the records were created” More than a thousand EAC-CPF records are available from the RAMP project.

METS, MODS, OAI-PMH service providers, and perhaps more

If you have archival descriptions in either of the METS or MODS formats, then transforming them into RDF is as far away as your XSLT processor and a content negotiation implementation. As of this writing there do not seem to be any METS to RDF stylesheets, but there are a couple stylesheets for MODS. The biggest issue with these sorts of implementations are the URIs. It will be necessary for archivists to include URIs into as many MODS id or xlink attributes as possible. The same thing holds true for METS files except the id attribute is not designed to hold pointers to external sites.

Some archives and libraries use a content management system called ContentDM. Whether they know it or not, ContentDM comes complete with an OAI-PMH (Open Archives Initiative – Protocol for Metadata Harvesting) interface. This means you can send a REST-ful URL to ContentDM, and you will get back an XML stream of metadata describing digital objects. Some of the digital objects in ContentDM (or any other OAI-PMH service provider) may be something worth exposing as linked data, and this can easily be done with a system called oai2lod. It is a particular implementation of D2RQ, described below, and works quite well. Download application. Feed oai2lod the “home page” of the OAI-PMH service provider, and oai2load will publish the OAI-PMH metadata as linked open data. This is another quick & dirty way to get started with linked data.

Databases

Publishing linked data through XML transformation is functional but not optimal. Publishing linked data from a database comes closer to the ideal but requires a greater amount of technical computer infrastructure and expertise.

Databases — specifically, relational databases — are the current best practice for organizing data. As you may or may not know, relational databases are made up of many tables of data joined together with keys. For example, a book may be assigned a unique identifier. The book has many characteristics such as a title, number of pages, size, descriptive note, etc. Some of the characteristics are shared by other books, like authors and subjects. In a relational database these shared characteristics would be saved in additional tables, and they would be joined to a specific book through the use of unique identifiers (keys). Given this sort of data structure, reports can be created from the database describing its content. Similarly, queries can be applied against the database to uncover relationships that may not be apparent at first glance or buried in reports. The power of relational databases lies in the use of keys to make relationships between rows in one table and rows in other tables. The downside of relational databases as a data model is infinite variety of fields/table combinations making them difficult to share across the Web.

Not coincidently, relational database technology is very much the way linked data is expected to be implemented. In the linked data world, the subjects of triples are URIs (think database keys). Each URI is associated with one or more predicates (think the characteristics in the book example). Each triple then has an object, and these objects take the form of literals or other URIs. In the book example, the object could be “Adventures Of Huckleberry Finn” or a URI pointing to Mark Twain. The reports of relational databases are analogous to RDF serializations, and SQL (the relational database query language) is analogous to SPARQL, the query language of RDF triple stores. Because of the close similarity between well-designed relational databases and linked data principles, the publishing of linked data directly from relational databases makes whole lot of sense, but the process requires the combined time and skills of a number of different people: content specialists, database designers, and computer programmers. Consequently, the process of publishing linked data from relational databases may be optimal, but it is more expensive.

Thankfully, many archivists probably use some sort of behind the scenes database to manage their collections and create their finding aids. Moreover, archivists probably use one of three or four tools for this purpose: Archivist’s Toolkit, Archon, ArchivesSpace, or PastPerfect. Each of these systems have a relational database at their heart. Reports could be written against the underlying databases to generate serialized RDF and thus begin the process of publishing linked data. Doing this from scratch would be difficult, as well as inefficient because many people would be starting out with the same database structure but creating a multitude of varying outputs. Consequently, there are two alternatives. The first is to use a generic database application to RDF publishing platform called D2RQ. The second is for the community to join together and create a holistic RDF publishing system based on the database(s) used in archives.

D2RQ is a very powerful software system. It is supported, well-documented, executable on just about any computing platform, open source, focused, functional, and at the same time does not try to be all things to all people. Using D2RQ it is more than possible to quickly and easily publish a well-designed relational database as RDF. The process is relatively simple:

  • download the software

  • use a command-line utility to map the database structure to a configuration file

  • edit the configuration file to meet your needs

  • run the D2RQ server using the configuration file as input thus allowing people or RDF user-agents to search and browse the database using linked data principles

  • alternatively, dump the contents of the database to an RDF serialization and ingest the result into your favorite RDF triple store

The downside of D2RQ is its generic nature. It will create an RDF ontology whose terms correspond to the names of database fields. These field names do not map to widely accepted ontologies & vocabularies and therefore will not interact well with communities outside the ones using a specific database structure. Still, the use of D2RQ is quick, easy, and accurate.

If you are going to be in Rome for only a few days, you will want to see the major sites, and you will want to adventure out & about a bit, but at the same time is will be a wise idea to follow the lead of somebody who has been there previously. Take the advise of these people. It is an efficient way to see some of the sights.

by Eric Lease Morgan at April 18, 2014 12:43 AM

April 17, 2014

LiAM: Linked Archival Metadata

Rome in a day, the archivist on a linked data pilgrimage way

If you to go to Rome for a day, then walk to the Colosseum and Vatican City. Everything you see along the way will be extra.

sistine chapelLinked data is not a fad. It is not a trend. It makes a lot of computing sense, and it is a modern way of fulfilling some the goals of archival practice. Just like Rome, it is not going away. An understanding of what linked data has to offer is akin to experiencing Rome first hand. Both will ultimately broaden your perspective. Consequently it is a good idea to make a concerted effort to learn about linked data, as well as visit Rome at least once. Once you have returned from your trip, discuss what you learned with your friends, neighbors, and colleagues. The result will be enlightening everybody.

The previous sections of this book described what linked data is and why it is important. The balance of book describes more of the how’s of linked data. For example, there is a glossary to help reenforce your knowledge of the jargon. You can learn about HTTP “content negotiation” to understand how actionable URIs can return HTML or RDF depending on the way you instruct remote HTTP servers. RDF stands for “Resource Description Framework”, and the “resources” are represented by URIs. A later section of the book describes ways to design the URIs of your resources. Learn how you can transform existing metadata records like MARC or EAD into RDF/XML, and then learn how to put the RDF/XML on the Web. Learn how to exploit your existing databases (such as the one’s under Archon, Archivist’s Toolkit, or ArchiveSpace) to generate RDF. If you are the Do It Yourself type, then play with and explore the guidebook’s tool section. Get the gentlest of introductions to searching RDF using a query language called SPARQL. Learn how to read and evaluate ontologies & vocabularies. They are manifested as XML files, and they are easily readable and visualizable using a number of programs. Read about and explore applications using RDF as the underlying data model. There are a growing number of them. The book includes a complete publishing system written in Perl, and if you approach the code of the publishing system as if it were a theatrical play, then the “scripts” read liked scenes. (Think of the scripts as if they were a type of poetry, and they will come to life. Most of the “scenes” are less than a page long. The poetry even includes a number of refrains. Think of the publishing system as if it were a one act play.) If you want to read more, and you desire a vetted list of books and articles, then a later section lists a set of further reading.

After you have spent some time learning a bit more about linked data, discuss what you have learned with your colleagues. There are many different aspects of linked data publishing, such as but not limited to:

  • allocating time and money
  • analyzing the RDF of yours as well as others
  • articulating policies
  • cleaning and improving RDF
  • collecting and harvesting the RDF of others
  • deciding what ontologies & vocabularies to use
  • designing local URIs
  • enhancing RDF triples stores by asserting additional relationships
  • finding and identifying URIs for the purposes of linking
  • making RDF available on the Web (SPARQL, RDFa, data dumps, etc.)
  • project management
  • provisioning value-added services against RDF (catalogs, finding aids, etc.)
  • storing RDF in triple stores

In archival practice, each of these things would be done by different sets of people: archivists & content specialists, administrators & managers, computer programers & systems administrators, metadata experts & catalogers. Each of these sets of people have a piece of the publishing puzzle and something significant to contribute to the work. Read about linked data. Learn about linked data. Bring these sets of people together discuss what you have learned. At the very least you will have a better collective understanding of the possibilities. If you don’t plan to “go to Rome” right away, you might decide to reconsider the “vacation” at another time.

Even Michelangelo, when he painted the Sistine Chapel, worked with a team of people each possessing a complementary set of skills. Each had something different to offer, and the discussion between themselves was key to their success.

by Eric Lease Morgan at April 17, 2014 01:33 AM

April 15, 2014

LiAM: Linked Archival Metadata

Four “itineraries” for putting linked data into practice for the archivist

If you to go to Rome for a day, then walk to the Colosseum and Vatican City. Everything you see along the way will be extra. If you to go to Rome for a few days, do everything you would do in a single day, eat and drink in a few cafes, see a few fountains, and go to a museum of your choice. For a week, do everything you would do in a few days, and make one or two day-trips outside Rome in order to get a flavor of the wider community. If you can afford two weeks, then do everything you would do in a week, and in addition befriend somebody in the hopes of establishing a life-long relationship.

map of vatican cithyWhen you read a guidebook on Rome — or any travel guidebook — there are simply too many listed things to see & do. Nobody can see all the sites, visit all the museums, walk all the tours, nor eat at all the restaurants. It is literally impossible to experience everything a place like Rome has to offer. So it is with linked data. Despite this fact, if you were to do everything linked data had to offer, then you would do all of things on the following list starting at the first item, going all the way down to evaluation, and repeating the process over and over:

  • design the structure your URIs
  • select/design your ontology & vocabularies — model your data
  • map and/or migrate your existing data to RDF
  • publish your RDF as linked data
  • create a linked data application
  • harvest other people’s data and create another application
  • evaluate
  • repeat

Given that it is quite possible you do not plan to immediately dive head-first into linked data, you might begin by getting your feet wet or dabbling in a bit of experimentation. That being the case, here are a number of different “itineraries” for linked data implementation. Think of them as strategies. They are ordered from least costly and most modest to greatest expense and completest execution:

  1. Rome in a day – Maybe you can’t afford to do anything right now, but if you have gotten this far in the guidebook, then you know something about linked data. Discuss (evaluate) linked data with with your colleagues, and consider revisiting the topic a year.
  2. Rome in three days – If you want something relatively quick and easy, but with the understanding that your implementation will not be complete, begin migrating your existing data to RDF. Use XSLT to transform your MARC or EAD files into RDF serializations, and publish them on the Web. Use something like OAI2RDF to make your OAI repositories (if you have them) available as linked data. Use something like D2RQ to make your archival description stored in databases accessible as linked data. Create a triple store and implement a SPARQL endpoint. As before, discuss linked data with your colleagues.
  3. Rome in week – Begin publishing RDF, but at the same time think hard about and document the structure of your future RDF’s URIs as well as the ontologies & vocabularies you are going to use. Discuss it with your colleagues. Migrate and re-publish your existing data as RDF using the documentation as a guide. Re-implement your SPARQL endpoint. Discuss linked data not only with your colleagues but with people outside archival practice.
  4. Rome in two weeks – First, do everything you would do in one week. Second, supplement your triple store with the RDF of others’. Third, write an application against the triple store that goes beyond search. In short, tell stories and you will be discussing linked data with the world, literally.

by Eric Lease Morgan at April 15, 2014 02:36 AM

April 14, 2014

LiAM: Linked Archival Metadata

Italian Lectures on Semantic Web and Linked Data

rome   croce   koha

Koha Gruppo Italiano has organized the following free event that may be of interest to linked data affectionatos in cultural heritage institutions:

Italian Lectures on Semantic Web and Linked Data: Practical Examples for Libraries, Wednesday May 7, 2014 at The American University of Rome – Auriana Auditorium (Via Pietro Roselli, 16 – Rome, Italy)

  • 9.00 – Benvenuto
    • Andrew Thompson (Executive Vice President and Provost AUR)
    • Juan Diego Ramírez (Direttore Biblioteca Pontificia Università della Santa Croce)
  • 9.15 – “So many opportunities! Which ones to choose?”, Eric Lease Morgan (University of Notre Dame)
  • 10.00 – “SKOS, Nuovo Soggettario e Wikidata: appunti per l’evoluzione dei sistemi di gestione dell’informazione bibliografica”, Giovanni Bergamin (Biblioteca Nazionale di Firenze)
  • 10.30 – “Open, Big, and Linked Data”, Stefano Bargioni (Biblioteca Pontificia Università della Santa Croce)
  • 11.00 – “La digitalizzazione di materiale archivistico e bibliotecario: un ulteriore elemento per valorizzare gli open data”, Bucap Spa
  • 11.15 – Coffee break
  • 11.45 – “xDams RELOADed: Cultural Heritage to the Web of Data”, Silvia Mazzini (Regesta.exe)
  • 12.00 – Discussion Panel: “L’avvento dei linked data e la fine del MARC”
    • Federico Meschini, moderatore (Università della Tuscia)
    • Lucia Panciera (Camera dei Deputati)
    • Fabio Di Giammarco (Biblioteca di Storia moderna e contemporanea)
    • Michele Missikoff e Marco Fratoddi (Stati Generali dell’Innovazione)
  • 13.00 – Conclusione dei lavori

Please RSVP to f.wallner at aur.edu by May 5.

This event is generously sponsored by regesta.exe, Bucap Document Imaging SpA, and SOS Archivi e Biblioteche.

regesta   bucap   sos

by Eric Lease Morgan at April 14, 2014 02:04 PM

April 12, 2014

LiAM: Linked Archival Metadata

Linked Archival Metadata: A Guidebook

A new but still “pre-published” version of the Linked Archival Metadata: A Guidebook is available. From the introduction:

The purpose of this guidebook is to describe in detail what linked data is, why it is important, how you can publish it to the Web, how you can take advantage of your linked data, and how you can exploit the linked data of others. For the archivist, linked data is about universally making accessible and repurposing sets of facts about you and your collections. As you publish these fact you will be able to maintain a more flexible Web presence as well as a Web presence that is richer, more complete, and better integrated with complementary collections.

And from the table of contents:

  • Executive Summary
  • Introduction
  • Linked data: A Primer
  • Getting Started: Strategies and Steps
  • Projects
  • Tools and Visualizations
  • Directories of ontologies
  • Content-negotiation and cURL
  • SPARQL tutorial
  • Glossary
  • Further reading
  • Scripts
  • A question from a library school student
  • Out takes

There are a number of versions:

Feedback desired and hoped for.

by Eric Lease Morgan at April 12, 2014 12:41 PM

April 08, 2014

Life of a Librarian

The 3D Printing Working Group is maturing, complete with a shiny new mailing list

A couple of weeks ago Kevin Phaup took the lead of facilitating a 3D printing workshop here in the Libraries’s Center For Digital Scholarship. More than a dozen students from across the University participated. Kevin presented them with an overview of 3D printing, pointed them towards a online 3D image editing application (Shapeshifter), and everybody created various objects which Matt Sisk has been diligently printing. The event was deemed a success, and there will probably be more specialized workshops scheduled for the Fall.

Since the last blog posting there has also been another Working Group meeting. A short dozen of us got together in Stinson-Remick where we discussed the future possibilities for the Group. The consensus was to create a more formal mailing list, maybe create a directory of people with 3D printing interests, and see about doing something more substancial — with a purpose — for the University.

To those ends, a mailing list has been created. Its name is 3D Printing Working Group . The list is open to anybody, and its purpose is to facilitate discussion of all things 3D printing around Notre Dame and the region. To subscribe address an email message to listserv@listserv.nd.edu, and in the body of the message include the following command:

subscribe nd-3d-printing Your Name

where Your Name is… your name.

Finally, the next meeting of the Working Group has been scheduled for Wednesday, May 14. It will be sponsored by Bob Sutton of Springboard Technologies, and it will be located in Innovation Park across from the University, and it will take place from 11:30 to 1 o’clock. I’m pretty sure lunch will be provided. The purpose of the meeting will be continue to outline the future directions of the Group as well as to see a demonstration of a printer called the Isis3D.

by Eric Lease Morgan at April 08, 2014 07:28 PM

April 04, 2014

LiAM: Linked Archival Metadata

What is linked data and why should I care?

“Tell me about Rome. Why should I go there?”

Linked data is a standardized process for sharing and using information on the World Wide Web. Since the process of linked data is woven into the very fabric of the way the Web operates, it is standardized and will be applicable as long as the Web is applicable. The process of linked data is domain agnostic meaning its scope is equally apropos to archives, businesses, governments, etc. Everybody can and everybody is equally invited to participate. Linked data is application independent. As long as your computer is on the Internet and knows about the World Wide Web, then it can take advantage of linked data.

Linked data is about sharing and using information (not mere data but data put into context). This information takes the form of simple “sentences” which are intended to be literally linked together to communicate knowledge. The form of linked data is similar to the forms of human language, and like human languages, linked data is expressive, nuanced, dynamic, and exact all at once. Because of its atomistic nature, linked data simultaneously simplifies and transcends previous information containers. It reduces the need for profession-specific data structures, but at the same time it does not negate their utility. This makes it easy for you to give your information away, and for you to use other people’s information.

The benefits of linked data boil down to two things: 1) it makes information more accessible to both people as well as computers, and 2) it opens the doors to any number of knowledge services limited only by the power of human imagination. Because it standardized, agnostic, independent, and mimics human expression linked data is more universal than the current processes of information dissemination. Universality infers decentralization, and decentralization promotes dissemination. On the Internet anybody can say anything at anytime. In the aggregate, this is a good thing and it enables information to be combined in ways yet to be imagined. Publishing information as linked data enables you to seamlessly enhance your own knowledge services as well as simultaneously enhance the knowledge of others.

“Rome is the Eternal City. After visting Rome you will be better equipped to participate in the global conversation of the human condition.”

by Eric Lease Morgan at April 04, 2014 08:51 PM

Impressed with ReLoad

I’m impressed with the linked data project called ReLoad. Their data is robust, complete, and full of URIs as well as human-readable labels. From the project’s home page:

The ReLoad project (Repository for Linked open archival data) will foster experimentation with the technology and methods of linked open data for archival resources. Its goal is the creation of a web of linked archival data.
LOD-LAM, which is an acronym for Linked Open Data for Libraries, Archives and Museums, is an umbrella term for the community and active projects in this area.

The first experimental phase will make use of W3C semantic web standards, mash-up techniques, software for linking and for defining the semantics of the data in the selected databases.

The archives that have made portions of their institutions’ data and databases openly available for this project are the Central State Archive, and the Cultural Heritage Institute of Emilia Romagna Region. These will be used to test methodologies to expose the resources as linked open data.

For example, try these links:

Their data is rich enough so things like LodLive can visualize resources well:

by Eric Lease Morgan at April 04, 2014 01:56 PM

April 03, 2014

Life of a Librarian

Digital humanities and libraries

This posting outlines a current trend in some academic libraries, specifically, the inclusion of digital humanities into their service offerings. It provides the briefest of introductions to the digital humanities, and then describes how one branch of the digital humanities — text mining — is being put into practice here in the Hesburgh Libraries’ Center For Digital Scholarship at the University of Notre Dame.

(This posting and its companion one-page handout was written for the Information Organization Research Group, School of Information Studies at the University of Wisconsin Milwaukee, in preparation for a presentation dated April 10, 2014.)

Digital humanities

busa
For all intents and purposes, the digital humanities is a newer rather than older scholarly endeavor. A priest named Father Busa is considered the “Father of the Digital Humanities” when, in 1965, he worked with IBM to evaluate the writings of Thomas Aquinas. With the advent of the Internet, ubiquitous desktop computing, an increased volume of digitized content, and sophisticated markup languages like TEI (the Text Encoding Initiative), the processes of digital humanities work has moved away from a fad towards a trend. While digital humanities work is sometimes called a discipline this author sees it more akin to a method. It is a process of doing “distant reading” to evaluate human expression. (The phrase “distant reading” is attributed to Franco Moretti who coined it in a book entitles Graphs, Maps, Trees: Abstract Models for a Literary History. Distant reading is complementary to “close reading”, and is used to denote the idea of observing many documents simultaneously.) The digital humanities community has grown significantly in the past ten or fifteen years complete with international academic conferences, graduate school programs, and scholarly publications.

Digital humanities work is a practice where digitized content of the humanist is quantitatively analyzed as if it were the content studied by a scientist. This sort of analysis can be done against any sort of human expression: written and spoken words, music, images, dance, sculpture, etc. Invariably, the process begins with counting and tabulating. This leads to measurement, which in turn provides opportunities for comparison. From here patterns can be observed and anomalies perceived. Finally, predictions, thesis, and judgements can be articulated. Digital humanities work does not replace the more traditional ways of experiencing expressions of the human condition. Instead it supplements the experience.

This author often compares the methods of the digital humanist to the reading of a thermometer. Suppose you observe an outdoor thermometer and it reads 32° (Fahrenheit). This reading, in and of itself, carries little meaning. It is only a measurement. In order to make sense of the reading it is important to put it into context. What is the weather outside? What time of year is it? What time of day is it? How does the reading compare to other readings? If you live in the Northern Hemisphere and the month is July, then the reading is probably an anomaly. On the other hand, if the month is January, then the reading is perfectly normal and not out of the ordinary. The processes of the digital humanist make it possible to make many measurements from a very large body of materials in order to evaluate things like texts, sounds, images, etc. It makes it possible to evaluate the totality of Victorian literature, the use of color in paintings over time, or the rhythmic similarities & difference between various forms of music.

Digital humanities centers in libraries

As the more traditional services of academic libraries become more accessible via the Internet, libraries have found the need to necessarily evolve. One manifestation of this evolution is the establishment of digital humanities centers. Probably one of oldest of these centers is located at the University of Virginia, but they now exist in many libraries across the country. These centers provide a myriad of services including combinations of digitization, markup, website creation, textual analysis, speaker series, etc. Sometimes these centers are akin to computing labs. Sometimes they are more like small but campus-wide departments staffed with scholars, researchers, and graduate students.

The Hesburgh Libraries’ Center For Digital Scholarship at the University of Notre Dame was recently established in this vein. The Center supports services around geographic information systems (GIS), data management, statistical analysis of data, and text mining. It is located in a 5,000 square foot space on the Libraries’s first floor and includes a myriad of computers, scanners, printers, a 3D printer, and collaborative work spaces. Below is an annotated list of projects the author has spent time against in regards to text mining and the Center. It is intended to give the reader a flavor of the types of work done in the Hesburgh Libraries:

  • Great Books – This was almost a tongue-in-cheek investigation to calculate which book was the “greatest” from a set of books called the Great Books of the Western World. The editors of the set defined a great book as one which discussed any one of a number of great ideas both deeply and broadly. These ideas were tabulated and compared across the corpus and then sorted by the resulting calculation. Aristotle’s Politics was determined to be the greatest book and Shakespeare was determined to have written nine of the top ten greatest books when it comes to the idea of love.
  • HathiTrust Research Center – The HathiTrust Research Center is a branch of the HathiTrust. The Center supports a number of algorithms used to do analysis against reader-defined worksets. The Center For Digital Scholarship facilitates workshops on the use of the HathiTrust Research Center as well as a small set of tools for programmatically searching and retrieving items from the HathiTrust.
  • JSTOR ToolData For Research (DFR) is a freely available and alternative interface to the bibliographic index called JSTOR. DFR enables the reader to search the entirety of JSTOR through a faceted querying. Search results are tabulated enabling the reader to create charts and graphs illustrating the results. Search results can be downloaded for more detailed investigations. JSTOR Tool is a Web-based application allowing the reader to summarize and do distant reading against these downloaded results.
  • PDF To Text – Text mining almost always requires the content of its investigation to be in the form of plain text, but much of the content used by people is in PDF. PDF To Text is a Web-based tool which extracts the plain text from PDF files and provides a number of services against the result (readability scores, ngram extraction, concordancing, and rudimentary parts-of-speech analysis.)
  • Perceptions of China – This project is in the earliest stages. Prior to visiting China students have identified photographs and written short paragraphs describing, in their minds, what they think of China. After visiting China the process is repeated. The faculty member leading the students on their trips to China wants to look for patterns of perception in the paragraphs.
  • Poverty Tourism – A university senior believes they have identified a trend — the desire to tourist poverty-stricken places. They identified as many as forty websites advertising “Come vist our slum”. Working with the Center they programmatically mirrored the content of the remote websites. They programmatically removed all the HTML tags from the mirrors. They then used Voyant Tools as well as various ngram tabulation tools to do distant reading against the corpus. Their investigations demonstrated the preponderant use of the word “you”, and they posit this because the authors of the websites are trying to get readers to imagine being in a slum.
  • State Trials – In collaboration with a number of other people, transcripts of the State Trials dating between 1650 and 1700 were analyzed. Digital versions of the Trails was obtained, and a number of descriptive analyses were done. The content was indexed and a timeline was created from search results. Ngram extraction was done as well as parts-of-speech analysis. Various types of similarity measures were done based on named entities and the over-all frequency of words (vectors). A stop word list was created based on additional frequency tabulations. Much of these analysis was visualized using word clouds, line charts, and histograms. This project is an excellent example of how much of digital humanities work is collaborative and requires the skills of many different types of people.
  • Tiny Text Mining Tools – Text mining is rooted in the counting and tabulation of words. Computers are very good at counting and tabulating. To that end a set of tiny text mining tools has been created enabling the Center to perform quick & dirty analysis against one or more items in a corpus. Written in Perl, the tools implement a well-respected relevancy ranking algorithm (term-frequency inverse document frequency or TFIDF) to support searching and classification, a cosine similarity measure for clustering and “finding more items like this one”, a concordancing (keyword in context) application, and an ngram (phrase) extractor.

Summary

starry night
Text mining, and digital humanities work in general, is simply the application computing techniques applied against the content of human expression. Their use is similar to use of the magnifying glass by Galileo. Instead of turning it down to count the number of fibers in a cloth (or to write an email message), it is being turned up to gaze at the stars (or to analyze the human condition). What he finds there is not so much truth as much as new ways to observe. The same is true of text mining and the digital humanities. They are additional ways to “see”.

Links

Here is a short list of links for further reading:

  • ACRL Digital Humanities Interest Group – This is a mailing list whose content includes mostly announcements of interest to librarians doing digital humanities work.
  • asking for it – Written by Bethany Nowviskie, this is a through response to the OCLC report, below.
  • dh+lib – A website amalgamating things of interest to digital humanities librarianship (job postings, conference announcements, blog posttings, newly established projects, etc.)
  • Digital Humanities and the Library: A Bibliography – Written by Miriam Posner, this is a nice list of print and digital readings on the topic of digital humanities work in libraries.
  • Does Every Research Library Need a Digital Humanities Center? – A recently published, OCLC-sponsored report intended for library directors who are considering the creation of a digital humanities center.
  • THATCamp – While not necessarily library-related THATCamp is a organization and process of facilitating informal digital humanities workshops, usually in academic settings.

by Eric Lease Morgan at April 03, 2014 03:02 PM

April 02, 2014

Life of a Librarian

Tiny Text Mining Tools

I have posted to Github the very beginnings of Perl library used to support simple and introductory text mining analysis — tiny text mining tools.

Presently the library is implemented in a set of subroutines stored in a single file supporting:

  • simple in-memory indexing and single-term searching
  • relevancy ranking through term-frequency inverse document frequency (TFIDF) for searching and classification
  • cosine similarity for clustering and “finding more items like this one”

I use these subroutines and the associated Perl scripts to do quick & dirty analysis against corpuses of journal articles, books, and websites.

I know, I know. It would be better to implement these thing as a set of Perl modules, but I’m practicing what I preach. “Give it away even if it is not ready.” The ultimate idea is to package these things into a single distribution, and enable researchers to have them at their finger tips as opposed to a Web-based application.

by Eric Lease Morgan at April 02, 2014 02:57 PM

March 30, 2014

LiAM: Linked Archival Metadata

Three RDF data models for archival collections

Listed and illustrated here are three examples of RDF data models for archival collections. It is interesting to literally see the complexity or thoroughness of each model, depending on your perspective.

rubinstein
This one was designed by Aaron Rubinstein. I don’t know whether or not it was ever put into practice.

lohac
This is the model used in Project LOACH by the Archives Hub.

pad
This final model — OAD — is being implemented in a project called ReLoad.

There are other ontologies of interest to cultural heritage institutions, but these three seem to be the most apropos to archivists.

This work is a part of a yet-to-be published book called the LiAM Guidebook, a text intended for archivists and computer technologists interested in the application of linked data to archival description.

by Eric Lease Morgan at March 30, 2014 06:49 PM

March 28, 2014

LiAM: Linked Archival Metadata

LiAM Guidebook – a new draft

I have made available a new draft of the LiAM Guidebook. Many of the lists of things (tools, projects, vocabulary terms, Semantic browsers, etc.) are complete. Once the lists are done I will move back to the narratives. Thanks go to various people I’ve interviewed lately (Gregory Colati, Karen Gracy, Susan Pyzynski, Aaron Rubinstein, Ed Summers, Diane Hillman, Anne Sauer, and Eliot Wilczek) because without them I would to have been able to get this far nor see a path forward.

by Eric Lease Morgan at March 28, 2014 02:44 AM

Linked data projects of interest to archivists (and other cultural heritage personnel)

While the number of linked data websites is less than the worldwide total number, it is really not possible to list every linked data project but only things that will presently useful to the archivist and computer technologist working in cultural heritage institutions. And even then the list of sites will not be complete. Instead, listed below are a number of websites of interest today. This list is a part of the yet-to-be published LiAM Guidebook.

Introductions

The following introductions are akin to directories or initial guilds filled with pointers to information about RDF especially meaningful to archivists (and other cultural heritage workers).

  • Datahub (http://datahub.io/) – This is a directory of data sets. It includes descriptions of hundreds of data collections. Some of them are linked data sets. Some of them are not.
  • LODLAM (http://lodlam.net/) – LODLAM is an acronym for Linked Open Data in Libraries Archives and Museums. LODLAM.net is a community, both virtual and real, of linked data aficionados in cultural heritage institutions. It, like OpenGLAM, is a good place to discuss linked data in general.
  • OpenGLAM (http://openglam.org) – GLAM is an acronym for Galleries, Libraries, Archives, and Museums. OpenGLAM is a community fostered by the Open Knowledge Foundation and a place to to discuss linked data that is “free”. for It, like LODLAM, is a good place to discuss linked data in general.
  • semanticweb.org (http://semanticweb.org) – semanticweb.org is a portal for publishing information on research and development related to the topics Semantic Web and Wikis. Includes data.semanticweb.org and data.semanticweb.org/snorql.

Data sets and projects

The data sets and projects range from simple RDF dumps to full-blown discovery systems. In between some simple browsable lists and raw SPARQL endpoints.

  • 20th Century Press Archives (http://zbw.eu/beta/p20) – This is an archive of digitized newspaper articles which is made accessible not only as HTML but a number of other metadata formats such as RDFa, METS/MODS and OAI-ORE. It is a good example of how metadata publishing can be mixed and matched in a single publishing system.
  • AGRIS (http://agris.fao.org/openagris/) – Here you will find a very large collection of bibliographic information from the field of agriculture. It is accessible via quite a number of methods including linked data.
  • D2R Server for the CIA Factbook (http://wifo5-03.informatik.uni-mannheim.de/factbook/) – The content of the World Fact Book distributed as linked data.
  • D2R Server for the Gutenberg Project (http://wifo5-03.informatik.uni-mannheim.de/gutendata/) – This is a data set of Project Gutenburgh content — a list of digitized public domain works, mostly books.
  • Dbpedia (http://dbpedia.org/About) – In the simplest terms, this is the content of Wikipedia made accessible as RDF.
  • Getty Vocabularies (http://vocab.getty.edu) – A set of data sets used to “categorize, describe, and index cultural heritage objects and information”.
  • Library of Congress Linked Data Service (http://id.loc.gov/) – A set of data sets used for bibliographic classification: subjects, names, genres, formats, etc.
  • LIBRIS (http://libris.kb.se) – This is the joint catalog of the Swedish academic and research libraries. Search results are presented in HTML, but the URLs pointing to individual items are really actionable URIs resolvable via content negotiation, thus support distribution of bibliographic information as RDF. This initiative is very similar to OpenCat.
  • Linked Archives Hub Test Dataset (http://data.archiveshub.ac.uk) – This data set is RDF generated from a selection of archival finding aids harvested by the Archives Hub in the United Kingdom.
  • Linked Movie Data Base (http://linkedmdb.org/) – A data set of movie information.
  • Linked Open Data at Europeana (http://pro.europeana.eu/datasets) – A growing set of RDF generated from the descriptions of content in Europeana.
  • Linked Open Vocabularies (http://lov.okfn.org/dataset/lov/) – A linked data set of linked data sets.
  • Linking Lives (http://archiveshub.ac.uk/linkinglives/) – While this project has had no working interface, it is a good read on the challenges of presenting link data people (as opposed to computers). Its blog site enumerates and discusses issues from provenance to unique identifiers, from data clean up to interface design.
  • LOCAH Project (http://archiveshub.ac.uk/locah/) – This is/was a joint project between Mimas and UKOLN to make Archives Hub data available as structured Linked Data. (All three organizations are located in the United Kingdom.). EAD files were aggregated. Using XSLT, they were transformed into RDF/XML, and the RDF/XML was saved in a triple store. The triple store was then dumped as a file as well as made searchable via a SPARQL endpoint.
  • New York Times (http://data.nytimes.com/) – A list of New York Times subject headings.
  • OCLC Data Sets & Services (http://www.oclc.org/data/) – Here you will find a number of freely available bibliographic data sets and services. Some are available as RDF and linked data. Others are Web services.
  • OpenCat (http://demo.cubicweb.org/opencatfresnes/) – This is a library catalog combining the authority data (available as RDF) provided by the National Library of France with works of a second library (Fresnes Public Library). Item level search results have URIs whose RDF is available via content negotiation. This project is similar to LIBRIS.
  • PELAGIOS (http://pelagios-project.blogspot.com/p/about-pelagios.html) – A data set of ancient places.
  • ReLoad (http://labs.regesta.com/progettoReload/en) – This is a collaboration between the Central State Archive of Italy, the Cultural Heritage Institute of Emilia Romagna Region, and Regesta.exe. It is the aggregation of EAD files from a number of archives which have been transformed into RDF and made available as linked data. Its purpose and intent are very similar to the the purpose and intent of the combined LOCAH Project and Linking Lives.
  • VIAF (http://viaf.org/) – This data set functions as a name authority file.
  • World Bank Linked Data (http://worldbank.270a.info/.html) – A data set of World Bank indicators, climate change information, finances, etc.

by Eric Lease Morgan at March 28, 2014 02:22 AM

RDF tools for the archivist

This posting lists various tools for archivists and computer technologists wanting to participate in various aspects of linked data. Here you will find pointers to creating, editing, storing, publishing, and searching linked data. It is a part of yet-to-be published LiAM Guidebook.

Directories

The sites listed in this section enumerate linked data and RDF tools. They are jumping off places to other sites:

RDF converters, validators, etc.

Use these tools to create RDF:

  • ead2rdf (http://data.archiveshub.ac.uk/xslt/ead2rdf.xsl) – This is the XST stylesheet previously used by the Archives Hub in their LOCAH Linked Archives Hub project. It transforms EAD files into RDF/XML. A slightly modified version of this stylesheet was used to create the LiAM “sandbox”.
  • Protégé (http://protege.stanford.edu) – Install this well-respected tool locally or use it as a hosted Web application to create OWL ontologies.
  • RDF2RDF (http://www.l3s.de/~minack/rdf2rdf/) – A handy Java jar file enabling you to convert various versions of serialized RDF into other versions of serialized RDF.
  • Vapour, a Linked Data Validator (http://validator.linkeddata.org/vapour) – Much like the W3C validator, this online tool will validate the RDF at the other end of a URI. Unlike the W3C validator, it echoes back and forth the results of the content negotiation process.
  • W3C RDF Validation Service (http://www.w3.org/RDF/Validator/) – Enter a URI or paste an RDF/XML document into the text field, and a triple representation of the corresponding data model as well as an optional graphical visualization of the data model will be displayed.

Linked data frameworks and publishing systems

Once RDF is created, use these systems to publish it as linked data:

  • 4store (http://4store.org/) – A linked data publishing framework for managing triple stores, querying them locally, querying them via SPARQL, dumping their contents to files, as well as providing support via a number of scripting languages (PHP, Ruby, Python, Java, etc.).
  • Apache Jena (http://jena.apache.org/) – This is a set of tools for creating, maintaining, and publishing linked data complete a SPARQL engine, a flexible triple store application, and inference engine.
  • D2RQ (http://d2rq.org/) – Use this application to provide a linked data front-end to any (well-designed) relational database. It supports SPARQL, content negotiation, and RDF dumps for direct HTTP access or uploading into triple store.
  • oai2lod (https://github.com/behas/oai2lod) – This is a particular implementation D2RQ Server. More specifically, this tool is an intermediary between a OAI-PMH data providers and a linked data publishing system. Configure oai2lod to point to your OAI-PMH server and it will publish the server’s metadata as linked data.
  • OpenLink Virtuoso Open-Source Edition (https://github.com/openlink/virtuoso-opensource/) – An open source version of OpenLink Virtuoso. Feature-rich and well-documented.
  • OpenLink Virtuoso Universal Server (http://virtuoso.openlinksw.com) – This is a commercial version of OpenLink Virtuoso Open-Source Edition. It seems to be a platform for modeling and accessing data in a wide variety of forms: relational databases, RDF triples stores, etc. Again, feature-rich and well-documented.
  • openRDF (http://www.openrdf.org/) – This is a Java-based framework for implementing linked data publishing including the establishment of a triple store and a SPARQL endpoint.

by Eric Lease Morgan at March 28, 2014 02:11 AM

March 22, 2014

LiAM: Linked Archival Metadata

Semantic Web browsers

This is a small set of Semantic Web browsers. Give them URIs and they allow you to follow and describe the links they include.

  • LOD Browser Switch (http://browse.semanticweb.org) – This is really a gateway to other Semantic Web browsers. Feed it a URI and it will create lists of URLs pointing to Semantic Web interfaces, but many of the URLs (Semantic Web interfaces) do not seem to work. Some of the resulting URLs point to RDF\ serialization converters
  • LodLive (http://en.lodlive.it) – This Semantic Web browser allows you to feed it a URI and interactively follow the links associated with it. URIs can come from DBedia, Freebase, or one of your own.
  • Open Link Data Explorer (http://demo.openlinksw.com/rdfbrowser2/) – The most sophisticated Semantic Web browser in this set. Given a URI it creates various views of the resulting triples associated with including lists of all its properties and objects, networks graphs, tabular views, and maps (if the data includes geographic points).
  • Quick and Dirty RDF browser (http://graphite.ecs.soton.ac.uk/browser/) – Given the URL pointing to a file of RDF statements, this tool returns all the triples in the file and verbosely lists each of their predicate and object values. Quick and easy. This is a good for reading everything about a particular resource. The tool does not seem to support content negotiation.

If you need some URIs to begin with, then try some of these:

by Eric Lease Morgan at March 22, 2014 08:45 PM

March 18, 2014

Life of a Librarian

University of Notre Dame 3-D Printing Working Group

This is the tiniest of blog postings describing a fledgling community here on campus called the University of Notre Dame 3-D Printing Working Group.

Working group

Working group

A few months ago Paul Turner said to me, “There are an increasing number of people across campus who are interested in 3-D printing. With your new space there in the Center, maybe you could host a few of us and we can share experiences.” Since then a few of us have gotten together a few times to discuss problems and solutions when it comes these relatively new devices. We have discussed things like:

  • what can these things be used for?
  • what are the advantages and disadvantages of different printers?
  • how and when does one charge for services?
  • what might the future of 3-D printing look like?
  • how can this technology be applied to the humanities?
nose

nose

Mike Elwell from Industrial Design hosted one of our meetings. We learned about “fab labs” and “maker spaces”. 3-D printing seems to be latest of gizmos for prototyping. He gave us a tour of his space, and I was most impressed with the laser cutter. At our most recent meeting Matt Leevy of Biology showed us how he is making models of people’s nasal cavities so doctors and practice before surgery. There I learned about the use of multiple plastics to do printing and how these multiple plastics can be used to make a more diverse set of objects. Because of the growing interest in 3-D printing, the Center will be hosting a beginner’s 3-D printing workshop in on March 28 from 1 – 5 o’clock and facilitated by graduate student Kevin Phaup.

With every get together there have been more and more people attending with faculty and staff from Biology, Industrial Design, the Libraries, Engineering, OIT, and Innovation Park. Our next meeting — just as loosely as the previous meetings — is scheduled for Friday, April 4 from noon to 1 o’clock in room 213 Stinson-Remick Hall. (That’s the new engineering building, and I believe it is the Notre Dame Design Deck space.) Everybody is welcome. The more people who attend, the more we can each get accomplished.

‘Hope to see you there!

by Eric Lease Morgan at March 18, 2014 07:49 PM

March 07, 2014

LiAM: Linked Archival Metadata

Semantic Web application

This posting outlines the implementation of a Semantic Web application.

Many people seem to think the ideas behind the Semantic Web (and linked data) are interesting, but many people are also waiting to see some of the benefits before committing resources to the effort. This is what I call the “chicken & egg problem of the linked data”.

While I have not created the application outlined below, I think it is more than feasible. It is a sort of inference engine feed with a URI and integer, both supplied by a person. Its ultimate goal is to find relationships between URIs that were not immediately or readily apparent.* It is a sort of “find more like this one” application. Here’s the algorithm:

  1. Allow the reader to select an actionable URI of personal interest, ideally a URI from the set of URIs you curate
  2. Submit the URI to an HTTP server or SPARQL endpoint and request RDF as output
  3. Save the output to a local store
  4. For each subject and object URI found the output, go to Step #2
  5. Go to step #2 n times for each newly harvested URI in the store where n is a reader-defined integer greater than 1; in other words, harvest more and more URIs, predicates, and literals based on the previously harvested URIs
  6. Create a set of human readable services/reports against the content of the store, and think of these services/reports akin to a type of finding aid, reference material, or museum exhibit of the future. Example services/reports might include:
    • hierarchal lists of all classes and properties – This would be a sort of semantic map. Each item on the map would be clickable allowing the reader to read more and drill down.
    • text mining reports – collect into a single “bag of words” all the literals saved in the store and create: word clouds, alphabetical lists, concordances, bibliographies, directories, gazetteers, tabulations of parts of speech, named entities, sentiment analyses, topic models, etc.
    • maps – use place names and geographic coordinates to implement a geographic information service
    • audio-visual mash-ups – bring together all the media information and create things like slideshows, movies, analyses of colors, shapes, patterns, etc.
    • search interfaces – implement a search interface against the result, SPARQL or otherwise
    • facts – remember SPARQL queries can return more than just lists. They can return mathematical results such as sums, ratios, standard deviations, etc. It can also return Boolean values helpful in answering yes/no questions. You could have a set of canned fact queries such as, how many ontologies are represented in the store. Is the number of ontologies greater than 3? Are there more than 100 names represented in this set? The count of languages used in the set, etc.
  7. Allow the reader to identify a new URI of personal interest, specifically one garnered from the reports generated in Step #6.
  8. Go to Step #2, but this time have the inference engine be more selective by having it try to crawl back to your namespace and set of locally curated URIs.
  9. Return to the reader the URIs identified in Step #8, and by consequence, these URIs ought to share some of the same characteristics as the very first URI; you have implemented a “find more like this one” tool. You, as curator of the collection of URIs might have thought the relations between the first URI and set of final URIs was obvious, but those relationships would not necessarily be obvious to the reader, and therefore new knowledge would have been created or brought to light.
  10. If there are no new URIs from Step #7, then go to Step #6 using the newly harvested content.
  11. Done. If a system were created such as the one above, then the reader would quite likely have acquired some new knowledge, and this would be especially true the greater the size of n in Step #5.
  12. Repeat. Optionally, have a computer program repeat the process with every URI in your curated collection, and have the program save the results for your inspection. You may find relationships you did not perceive previously.

I believe many people perceive the ideas behind the Semantic Web to be akin to investigations in artificial intelligence. To some degree this is true, and investigations into artificial intelligence seem to come and go in waves. “Expert systems” and “neural networks” were incarnations of artificial intelligence more than twenty years ago. Maybe the Semantic Web is just another in a long wave of forays.

On the other hand, Semantic Web applications do not need to be so sublime. They can be as simple as discovery systems, browsable interfaces, or even word clouds. The ideas behind the Semantic Web and linked data are implementable. It just a shame that nothing is catching the attention of the wider audiences.

* Remember, URIs are identifiers intended to represent real world objects and/or descriptions of real-world objects. URIs are perfect for cultural heritage institutions because cultural heritage institutions maintain both.

by Eric Lease Morgan at March 07, 2014 09:47 PM

February 28, 2014

LiAM: Linked Archival Metadata

SPARQL tutorial

This is the simplest of SPARQL tutorials. The tutorial’s purpose is two-fold: 1) through a set of examples, introduce the reader to the syntax of SPARQL queries, and 2) to enable the reader to initially explore any RDF triple store which is exposed as a SPARQL endpoint.

SPARQL (SPARQL protocol and RDF query language) is a set of commands used to search RDF triple stores. It is modeled after SQL (structured query language), the set of commands used to search relational databases. If you are familiar with SQL, then SPARQL will be familiar. If not, then think of SPARQL queries as formalized sentences used to ask a question and get back a list of answers.

Also, remember, RDF is a data structure of triples: 1) subjects, 2) predicates, and 3) objects. The subjects of the triples are always URIs — identifiers of “things”. Predicates are also URIs, but these URIs are intended to denote relationships between subjects and objects. Objects are preferably URIs but they can also be literals (words or numbers). Finally, RDF objects and predicates are defined in human-created ontologies as a set of classes and properties where classes are abstract concepts and properties are characteristics of the concepts.

Try the following steps with just about any SPARQL endpoint:

  1. Get an overview- A good way to begin is to get a list of all the ontological classes in the triple store. In essence, the query below says, “Find all the unique objects in the triple store where any subject is a type of object, and sort the result by object.”
  • Learn about the employed ontologies- Ideally, each of the items in the result will be an actionable URI in the form of a “cool URL”. Using your Web browser, you ought to be able to go to the URL and read a thorough description of the given class, but the URLs are not always actionable.
  • Learn more about the employed ontologies- Using the following query you can create a list of all the properties in the triple store as well as infer some of the characteristics of each class. Unfortunately, this particular query is intense. It may require a long time to process or may not return at all. In English, the query says, “Find all the unique predicates where the RDF triple has any subject, any predicate, or any object, and sort the result by predicate.”
  • Guess- Steps #2 and Step #3 are time intensive, and consequently it is sometimes easier just browse the triple store by selecting one of the “cool URLs” returned in Step #1. You can submit a modified version of Step #1′s query. It says, “Find all the subjects where any RDF subject (URI) is a type of object (class)”. Using the
    LiAM triple store, the following query tries to find all the things that are EAD finding aids.
  • Learn about a specific thing- The result of Step #4 ought to be a list of (hopefully actionable) URIs. You can learn everything about that URI with the following query. It says, “Find all the predicates and objects in the triple store where the RDF triple’s subject is a given value and the predicate and object are of any value, and sort the result by predicate”. In this case, the given value is one of the items returned from Step #4.
  • Repeat a few times- If the results from Step #5 returned seemingly meaningful and complete information about your selected URI, then repeat Step #5 a few times to get a better feel for some of the “things” in the triple store. If the results were not meaningful, then got to Step #4 to browser another class.
  • Take these hints- The first of these following two queries generates a list of ten URIs pointing to things that came from MARC records. The second query is used to return everything about a specific URI whose data came from a MARC record.
  • Read the manual- At this point, it is a good idea to go back to Step #2 and read the more formal descriptions of the underlying ontologies.
  • Browse some more- If the results of Step #3 returned successfully, then browse the objects in the triple store by selecting a predicate of interest. The following queries demonstrate how to list things like titles, creators, names, and notes.
  • Read about SPARQL- This was the tiniest of SPARQL tutorials. Using the
    LiAM data setas an example, it demonstrated how to do the all but simplest queries against a RDF triple store. There is a whole lot more to SPARQL than SELECT, DISTINCT, WHERE, ORDER BY, AND LIMIT commands. SPARQL supports a short-hand way of denoting classes and properties called PREFIX. It supports Boolean operations, limiting results based on “regular expressions”, and a few mathematical functions. SPARQL can also be used to do inserts and deletes against the triple store. The next step is to read more about SPARQL. Consider reading the
    canonical documentationfrom the W3C, ”
    SPARQL by example“, and the Jena project’s ”
    SPARQL Tutorial“. [1, 2, 3]
  • Finally, don’t be too intimidated about SPARQL. Yes, it is possible to submit SPARQL queries by hand, but in reality, person-friendly front-ends are expected to be created making search much easier.

    by Eric Lease Morgan at February 28, 2014 03:13 AM

    February 17, 2014

    Life of a Librarian

    CrossRef’s Prospect API

    This is the tiniest of blog postings outlining my experiences with a fledgling API called Prospect.

    Prospect is an API being developed by CrossRef. I learned about it through both word-of-mouth as well as a blog posting by Eileen Clancy called “Easy access to data for text mining“. In a nutshell, given a CrossRef DOI via content negotiation, the API will return both the DOI’s bibliographic information as well as URL(s) pointing to the location of full text instances of the article. The purpose of the API is to provide a straight-forward method for acquiring full text content without the need for screen scraping.

    I wrote a simple, almost brain-deal Perl subroutine implementing the API. For a good time, I put the subroutine into action in a CGI script. Enter a simple query, and the script will search CrossRef for full text articles, and return a list of no more than five (5) titles and their associated URL’s where you can get them in a number of formats.

    screen shot
    screen shot of CrossRef Prospect API in action

    The API is pretty straight-forward, but the URLs pointing to the full text are stuffed into a “Links” HTTP header, and the value of the header is not as easily parseable as one might desire. Still, this can be put to good use in my slowly growing stock of text mining tools. Get DOI. Feed to one of my tools. Get data. Do analysis.

    Fun with HTTP.

    by Eric Lease Morgan at February 17, 2014 06:11 PM

    Analyzing search results using JSTOR’s Data For Research

    Introduction

    Data For Research (DFR) is an alternative interface to JSTOR enabling the reader to download statistical information describing JSTOR search results. For example, using DFR a person can create a graph illustrating when sets of citations where written, create a word cloud illustrating the most frequently used words in a journal article, or classify sets of JSTOR articles according to a set of broad subject headings. More advanced features enable the reader to extract frequently used phrases in a text as well as list statistically significant keywords. JSTOR’s DFR is a powerful tool enabling the reader to look for trends in large sets of articles as well as drill down into the specifics of individual articles. This hands-on workshop leads the student through a set of exercises demonstrating these techniques.

    Faceted searching

    DFR supports an easy-to-use search interface. Enter one or two words into the search box and submit your query. Alternatively you can do some field searching using the advanced search options. The search results are then displayed and sortable by date, relevance, or a citation rank. More importantly, facets are displayed along side the search results, and searches can be limited by selecting one or more of the facet terms. Limiting by years, language, subjects, and disciplines prove to be the most useful.

    screen shot
    search results screen

    Publication trends over time

    By downloading the number of citations from multiple search results, it is possible to illustrate publication trends over time.

    In the upper right-hand corner of every search result is a “charts view” link. Once selected it will display a line graph illustrating the number of citations fitting your query over time. It also displays a bar chart illustrating the broad subject areas of your search results. Just as importantly, there is a link at the bottom of the page — “Download data for year chart” — allowing you to download a comma-separated (CSV) file of publication counts and years. This file is easily importable into your favorite spreadsheet program and chartable. If you do multiple searches and download multiple CSV files, then you can compare publication trends. For example, the following chart compares the number of times the phrases “Henry Wadsworth Longfellow”, “Henry David Thoreau”, and “Ralph Waldo Emerson” have appeared in the JSTOR literature between 1950 and 2000. From the chart we can see that Emerson was consistently mentioned more of than both Longfellow and Thoreau. It would be interesting to compare the JSTOR results with the results from Google Books Ngram Viewer, which offers a similar service against their collection of digitized books.

    chart view
    chart view screen shot

    publication trends
    publications trends for Emerson, Thoreau, and Longfellow

    Key word analysis

    DFR counts and tabulates frequently used words and statistically significant key words. These tabulations can be used to illustrate characteristics of search results.

    Each search result item comes complete with title, author, citation, subject, and key terms information. The subjects and key terms are computed values — words and phrases determined by frequency and statistical analysis. Each search result item comes with a “More Info” link which returns lists of the item’s most frequently used words, phrases, and keyword terms. Unfortunately, these lists often include stop words like “the”, “of”, “that”, etc. making the results not as meaningful as they could be. Still, these lists are somewhat informative. They allude to the “aboutness” of the selected article.

    Key terms are also facets. You can expand the Key terms facets to get a small word cloud illustrating the frequency of each term across the entire search result. Clicking on one of the key terms limits the search results accordingly. You can also click on the Export button to download a CVS file of key terms and their frequency. This information can then be fed to any number of applications for creating word clouds. For example, download the CSV file. Use your text editor to open the CSV file, and find/replace the commas with colons. Copy the entire result, and paste it into Wordle’s advanced interface. This process can be done multiple times for different searches, and the results can be compared & contrasted. Word clouds for Longfellow, Thoreau, and Emerson are depicted below, and from the results you can quickly see both similarities and differences between each writer.

    emerson
    Ralph Waldo Emerson key terms

    thoreau
    Henry David Thoreau key terms

    longfellow
    Henry Wadsworth Longfellow key terms

    Downloading complete data sets

    If you create a DFR account, and if you limit your search results to 1,000 items or less, then you can download a data set describing your search results.

    In the upper right-hand corner of the search results screen is a pull-down menu option for submitting data set requests. The resulting screen presents you with options for downloading a number of different types of data (citations, word counts, phrases, and key terms) in two different formats (CSV and XML). The CSV format is inherently easier to use, but the XML format seems to be more complete, especially when it comes to citation information. After submitting your data set request you will have to wait for an email message from DFR because it takes a while (anywhere from a few minutes to a couple of hours) for it to be compiled.

    data set request
    data set request page

    After downloading a data set you can do additional analysis against it. For example, it is possible to create a timeline illustrating when individual articles where written. It is not be too difficult to create word clouds from titles or author names. If you have programming experience, then you might be be able to track ideas over time or the originator of specific ideas. Concordances — keyword in context search engines — can be implemented. Some of this functionality, but certainly not all, is being slowly implemented in a Web-based application called JSTOR Tool.

    Summary

    As the written word is increasingly manifested in digital form, so does the ability to evaluate the written word quantifiably. JSTOR’s DFR is one example of how this can be exploited for the purposes of academic research.

    Note

    A .zip file containing some sample data and well as the briefest of instructions on how to use it is linked from this document.

    by Eric Lease Morgan at February 17, 2014 03:58 PM

    Mini-musings

    LiAM source code: Perl poetry

    #!/usr/bin/perl # Liam Guidebook Source Code; Perl poetry, sort of # Eric Lease Morgan <emorgan@nd.edu> # February 16, 2014 # done exit;

    #!/usr/bin/perl # marc2rdf.pl – make MARC records accessible via linked data # Eric Lease Morgan <eric_morgan@infomotions.com> # December 5, 2013 – first cut; # configure use constant ROOT => ‘/disk01/www/html/main/sandbox/liam’; use constant MARC => ROOT . ‘/src/marc/’; use constant DATA => ROOT . ‘/data/’; use constant PAGES => ROOT . ‘/pages/’; use constant MARC2HTML => ROOT . ‘/etc/MARC21slim2HTML.xsl’; use constant MARC2MODS => ROOT . ‘/etc/MARC21slim2MODS3.xsl’; use constant MODS2RDF => ROOT . ‘/etc/mods2rdf.xsl’; use constant MAXINDEX => 100; # require use IO::File; use MARC::Batch; use MARC::File::XML; use strict; use XML::LibXML; use XML::LibXSLT; # initialize my $parser = XML::LibXML->new; my $xslt = XML::LibXSLT->new; # process each record in the MARC directory my @files = glob MARC . “*.marc”; for ( 0 .. $#files ) { # re-initialize my $marc = $files[ $_ ]; my $handle = IO::File->new( $marc ); binmode( STDOUT, ‘:utf8′ ); binmode( $handle, ‘:bytes’ ); my $batch = MARC::Batch->new( ‘USMARC’, $handle ); $batch->warnings_off; $batch->strict_off; my $index = 0; # process each record in the batch while ( my $record = $batch->next ) { # get marcxml my $marcxml = $record->as_xml_record; my $_001 = $record->field( ’001′ )->as_string; $_001 =~ s/_//; $_001 =~ s/ +//; $_001 =~ s/-+//; print ” marc: $marc\n”; print ” identifier: $_001\n”; print ” URI: http://infomotions.com/sandbox/liam/id/$_001\n”; # re-initialize and sanity check my $output = PAGES . “$_001.html”; if ( ! -e $output or -s $output == 0 ) { # transform marcxml into html print ” HTML: $output\n”; my $source = $parser->parse_string( $marcxml ) or warn $!; my $style = $parser->parse_file( MARC2HTML ) or warn $!; my $stylesheet = $xslt->parse_stylesheet( $style ) or warn $!; my $results = $stylesheet->transform( $source ) or warn $!; my $html = $stylesheet->output_string( $results ); &save( $output, $html ); } else { print ” HTML: skipping\n” } # re-initialize and sanity check my $output = DATA . “$_001.rdf”; if ( ! -e $output or -s $output == 0 ) { # transform marcxml into mods my $source = $parser->parse_string( $marcxml ) or warn $!; my $style = $parser->parse_file( MARC2MODS ) or warn $!; my $stylesheet = $xslt->parse_stylesheet( $style ) or warn $!; my $results = $stylesheet->transform( $source ) or warn $!; my $mods = $stylesheet->output_string( $results ); # transform mods into rdf print ” RDF: $output\n”; $source = $parser->parse_string( $mods ) or warn $!; my $style = $parser->parse_file( MODS2RDF ) or warn $!; my $stylesheet = $xslt->parse_stylesheet( $style ) or warn $!; my $results = $stylesheet->transform( $source ) or warn $!; my $rdf = $stylesheet->output_string( $results ); &save( $output, $rdf ); } else { print ” RDF: skipping\n” } # prettify print “\n”; # increment and check $index++; last if ( $index > MAXINDEX ) } } # done exit; sub save { open F, ‘ > ‘ . shift or die $!; binmode( F, ‘:utf8′ ); print F shift; close F; return; }

    #!/usr/bin/perl # ead2rdf.pl – make EAD files accessible via linked data # Eric Lease Morgan <eric_morgan@infomotions.com> # December 6, 2013 – based on marc2linkedata.pl # configure use constant ROOT => ‘/disk01/www/html/main/sandbox/liam’; use constant EAD => ROOT . ‘/src/ead/’; use constant DATA => ROOT . ‘/data/’; use constant PAGES => ROOT . ‘/pages/’; use constant EAD2HTML => ROOT . ‘/etc/ead2html.xsl’; use constant EAD2RDF => ROOT . ‘/etc/ead2rdf.xsl’; use constant SAXON => ‘java -jar /disk01/www/html/main/sandbox/liam/bin/saxon.jar -s:##SOURCE## -xsl:##XSL## -o:##OUTPUT##’; # require use strict; use XML::XPath; use XML::LibXML; use XML::LibXSLT; # initialize my $saxon = ”; my $xsl = ”; my $parser = XML::LibXML->new; my $xslt = XML::LibXSLT->new; # process each record in the EAD directory my @files = glob EAD . “*.xml”; for ( 0 .. $#files ) { # re-initialize my $ead = $files[ $_ ]; print ” EAD: $ead\n”; # get the identifier my $xpath = XML::XPath->new( filename => $ead ); my $identifier = $xpath->findvalue( ‘/ead/eadheader/eadid’ ); $identifier =~ s/[^\w ]//g; print ” identifier: $identifier\n”; print ” URI: http://infomotions.com/sandbox/liam/id/$identifier\n”; # re-initialize and sanity check my $output = PAGES . “$identifier.html”; if ( ! -e $output or -s $output == 0 ) { # transform marcxml into html print ” HTML: $output\n”; my $source = $parser->parse_file( $ead ) or warn $!; my $style = $parser->parse_file( EAD2HTML ) or warn $!; my $stylesheet = $xslt->parse_stylesheet( $style ) or warn $!; my $results = $stylesheet->transform( $source ) or warn $!; my $html = $stylesheet->output_string( $results ); &save( $output, $html ); } else { print ” HTML: skipping\n” } # re-initialize and sanity check my $output = DATA . “$identifier.rdf”; if ( ! -e $output or -s $output == 0 ) { # create saxon command, and save rdf print ” RDF: $output\n”; $saxon = SAXON; $xsl = EAD2RDF; $saxon =~ s/##SOURCE##/$ead/e; $saxon =~ s/##XSL##/$xsl/e; $saxon =~ s/##OUTPUT##/$output/e; system $saxon; } else { print ” RDF: skipping\n” } # prettify print “\n”; } # done exit; sub save { open F, ‘ > ‘ . shift or die $!; binmode( F, ‘:utf8′ ); print F shift; close F; return; }

    #!/usr/bin/perl # store-make.pl – simply initialize an RDF triple store # Eric Lease Morgan <eric_morgan@infomotions.com> # # December 14, 2013 – after wrestling with wilson for most of the day # configure use constant ETC => ‘/disk01/www/html/main/sandbox/liam/etc/’; # require use strict; use RDF::Redland; # sanity check my $db = $ARGV[ 0 ]; if ( ! $db ) { print “Usage: $0 <db>\n”; exit; } # do the work; brain-dead my $etc = ETC; my $store = RDF::Redland::Storage->new( ‘hashes’, $db, “new=’yes’, hash-type=’bdb’, dir=’$etc’” ); die “Unable to create store ($!)” unless $store; my $model = RDF::Redland::Model->new( $store, ” ); die “Unable to create model ($!)” unless $model; # “save” $store = undef; $model = undef; # done exit;

    #!/user/bin/perl # store-add.pl – add items to an RDF triple store # Eric Lease Morgan <eric_morgan@infomotions.com> # # December 14, 2013 – after wrestling with wilson for most of the day # configure use constant ETC => ‘/disk01/www/html/main/sandbox/liam/etc/’; # require use strict; use RDF::Redland; # sanity check #1 – command line arguments my $db = $ARGV[ 0 ]; my $file = $ARGV[ 1 ]; if ( ! $db or ! $file ) { print “Usage: $0 <db> <file>\n”; exit; } # sanity check #2 – store exists die “Error: po2s file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-po2s.db’ ); die “Error: so2p file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-so2p.db’ ); die “Error: sp2o file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-sp2o.db’ ); # open the store my $etc = ETC; my $store = RDF::Redland::Storage->new( ‘hashes’, $db, “new=’no’, hash-type=’bdb’, dir=’$etc’” ); die “Error: Unable to open store ($!)” unless $store; my $model = RDF::Redland::Model->new( $store, ” ); die “Error: Unable to create model ($!)” unless $model; # sanity check #3 – file exists die “Error: $file not found.\n” if ( ! -e $file ); # parse a file and add it to the store my $uri = RDF::Redland::URI->new( “file:$file” ); my $parser = RDF::Redland::Parser->new( ‘rdfxml’, ‘application/rdf+xml’ ); die “Error: Failed to find parser ($!)\n” if ( ! $parser ); my $stream = $parser->parse_as_stream( $uri, $uri ); my $count = 0; while ( ! $stream->end ) { $model->add_statement( $stream->current ); $count++; $stream->next; } # echo the result warn “Namespaces:\n”; my %namespaces = $parser->namespaces_seen; while ( my ( $prefix, $uri ) = each %namespaces ) { warn ” prefix: $prefix\n”; warn ‘ uri: ‘ . $uri->as_string . “\n”; warn “\n”; } warn “Added $count statements\n”; # “save” $store = undef; $model = undef; # done exit; 10.5 store-search.pl – query a triple store # Eric Lease Morgan <eric_morgan@infomotions.com> # December 14, 2013 – after wrestling with wilson for most of the day # configure use constant ETC => ‘/disk01/www/html/main/sandbox/liam/etc/’; my %namespaces = ( “crm” => “http://erlangen-crm.org/current/”, “dc” => “http://purl.org/dc/elements/1.1/”, “dcterms” => “http://purl.org/dc/terms/”, “event” => “http://purl.org/NET/c4dm/event.owl#”, “foaf” => “http://xmlns.com/foaf/0.1/”, “lode” => “http://linkedevents.org/ontology/”, “lvont” => “http://lexvo.org/ontology#”, “modsrdf” => “http://simile.mit.edu/2006/01/ontologies/mods3#”, “ore” => “http://www.openarchives.org/ore/terms/”, “owl” => “http://www.w3.org/2002/07/owl#”, “rdf” => “http://www.w3.org/1999/02/22-rdf-syntax-ns#”, “rdfs” => “http://www.w3.org/2000/01/rdf-schema#”, “role” => “http://simile.mit.edu/2006/01/roles#”, “skos” => “http://www.w3.org/2004/02/skos/core#”, “time” => “http://www.w3.org/2006/time#”, “timeline” => “http://purl.org/NET/c4dm/timeline.owl#”, “wgs84_pos” => “http://www.w3.org/2003/01/geo/wgs84_pos#” ); # require use strict; use RDF::Redland; # sanity check #1 – command line arguments my $db = $ARGV[ 0 ]; my $query = $ARGV[ 1 ]; if ( ! $db or ! $query ) { print “Usage: $0 <db> <query>\n”; exit; } # sanity check #2 – store exists die “Error: po2s file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-po2s.db’ ); die “Error: so2p file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-so2p.db’ ); die “Error: sp2o file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-sp2o.db’ ); # open the store my $etc = ETC; my $store = RDF::Redland::Storage->new( ‘hashes’, $db, “new=’no’, hash-type=’bdb’, dir=’$etc’” ); die “Error: Unable to open store ($!)” unless $store; my $model = RDF::Redland::Model->new( $store, ” ); die “Error: Unable to create model ($!)” unless $model; # search #my $sparql = RDF::Redland::Query->new( “CONSTRUCT { ?a ?b ?c } WHERE { ?a ?b ?c }”, undef, undef, “sparql” ); my $sparql = RDF::Redland::Query->new( “PREFIX modsrdf: <http://simile.mit.edu/2006/01/ontologies/mods3#>\nSELECT ?a ?b ?c WHERE { ?a modsrdf:$query ?c }”, undef, undef, ‘sparql’ ); my $results = $model->query_execute( $sparql ); print $results->to_string; # done exit;

    #!/usr/bin/perl # store-dump.pl – output the content of store as RDF/XML # Eric Lease Morgan <eric_morgan@infomotions.com> # # December 14, 2013 – after wrestling with wilson for most of the day # configure use constant ETC => ‘/disk01/www/html/main/sandbox/liam/etc/’; # require use strict; use RDF::Redland; # sanity check #1 – command line arguments my $db = $ARGV[ 0 ]; my $uri = $ARGV[ 1 ]; if ( ! $db ) { print “Usage: $0 <db> <uri>\n”; exit; } # sanity check #2 – store exists die “Error: po2s file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-po2s.db’ ); die “Error: so2p file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-so2p.db’ ); die “Error: sp2o file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-sp2o.db’ ); # open the store my $etc = ETC; my $store = RDF::Redland::Storage->new( ‘hashes’, $db, “new=’no’, hash-type=’bdb’, dir=’$etc’” ); die “Error: Unable to open store ($!)” unless $store; my $model = RDF::Redland::Model->new( $store, ” ); die “Error: Unable to create model ($!)” unless $model; # do the work my $serializer = RDF::Redland::Serializer->new; print $serializer->serialize_model_to_string( RDF::Redland::URI->new, $model ); # done exit;

    #!/usr/bin/perl # sparql.pl – a brain-dead, half-baked SPARQL endpoint # Eric Lease Morgan <eric_morgan@infomotions.com> # December 15, 2013 – first investigations # require use CGI; use CGI::Carp qw( fatalsToBrowser ); use RDF::Redland; use strict; # initialize my $cgi = CGI->new; my $query = $cgi->param( ‘query’ ); if ( ! $query ) { print $cgi->header; print &home } else { # open the store for business my $store = RDF::Redland::Storage->new( ‘hashes’, ‘store’, “new=’no’, hash-type=’bdb’, dir=’/disk01/www/html/main/sandbox/liam/etc’” ); my $model = RDF::Redland::Model->new( $store, ” ); # search my $results = $model->query_execute( RDF::Redland::Query->new( $query, undef, undef, ‘sparql’ ) ); # return the results print $cgi->header( -type => ‘application/xml’ ); print $results->to_string; } # done exit; sub home { # create a list namespaces my $namespaces = &namespaces; my $list = ”; foreach my $prefix ( sort keys $namespaces ) { my $uri = $$namespaces{ $prefix }; $list .= $cgi->li( “$prefix – ” . $cgi->a( { href=> $uri, target => ‘_blank’ }, $uri ) ); } $list = $cgi->ol( $list ); # return a home page return <<EOF <html> <head> <title>LiAM SPARQL Endpoint</title> </head> <body style=’margin: 7%’> <h1>LiAM SPARQL Endpoint</h1> <p>This is a brain-dead and half-baked SPARQL endpoint to a subset of LiAM linked data. Enter a query, but there is the disclaimer. Errors will probably happen because of SPARQL syntax errors. Remember, the interface is brain-dead. Your milage <em>will</em> vary.</p> <form method=’GET’ action=’./’> <textarea style=’font-size: large’ rows=’5′ cols=’65′ name=’query’ /> PREFIX hub:<http://data.archiveshub.ac.uk/def/> SELECT ?uri WHERE { ?uri ?o hub:FindingAid } </textarea><br /> <input type=’submit’ value=’Search’ /> </form> <p>Here are a few sample queries:</p> <ul> <li>Find all triples with RDF Schema labels – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=PREFIX+rdf%3A%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0ASELECT+*+WHERE+%7B+%3Fs+rdf%3Alabel+%3Fo+%7D%0D%0A”>PREFIX rdf:<http://www.w3.org/2000/01/rdf-schema#> SELECT * WHERE { ?s rdf:label ?o }</a></code></li> <li>Find all items with MODS subjects – <code><a href=’http://infomotions.com/sandbox/liam/sparql/?query=PREFIX+mods%3A%3Chttp%3A%2F%2Fsimile.mit.edu%2F2006%2F01%2Fontologies%2Fmods3%23%3E%0D%0ASELECT+*+WHERE+%7B+%3Fs+mods%3Asubject+%3Fo+%7D’>PREFIX mods:<http://simile.mit.edu/2006/01/ontologies/mods3#> SELECT * WHERE { ?s mods:subject ?o }</a></code></li> <li>Find every unique predicate – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=SELECT+DISTINCT+%3Fp+WHERE+%7B+%3Fs+%3Fp+%3Fo+%7D”>SELECT DISTINCT ?p WHERE { ?s ?p ?o }</a></code></li> <li>Find everything – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+%7D”>SELECT * WHERE { ?s ?p ?o }</a></code></li> <li>Find all classes – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=SELECT+DISTINCT+%3Fclass+WHERE+%7B+%5B%5D+a+%3Fclass+%7D+ORDER+BY+%3Fclass”>SELECT DISTINCT ?class WHERE { [] a ?class } ORDER BY ?class</a></code></li> <li>Find all properties – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=SELECT+DISTINCT+%3Fproperty%0D%0AWHERE+%7B+%5B%5D+%3Fproperty+%5B%5D+%7D%0D%0AORDER+BY+%3Fproperty”>SELECT DISTINCT ?property WHERE { [] ?property [] } ORDER BY ?property</a></code></li> <li>Find URIs of all finding aids – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=PREFIX+hub%3A%3Chttp%3A%2F%2Fdata.archiveshub.ac.uk%2Fdef%2F%3E+SELECT+%3Furi+WHERE+%7B+%3Furi+%3Fo+hub%3AFindingAid+%7D”>PREFIX hub:<http://data.archiveshub.ac.uk/def/> SELECT ?uri WHERE { ?uri ?o hub:FindingAid }</a></code></li> <li>Find URIs of all MARC records – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=PREFIX+mods%3A%3Chttp%3A%2F%2Fsimile.mit.edu%2F2006%2F01%2Fontologies%2Fmods3%23%3E+SELECT+%3Furi+WHERE+%7B+%3Furi+%3Fo+mods%3ARecord+%7D%0D%0A%0D%0A%0D%0A”>PREFIX mods:<http://simile.mit.edu/2006/01/ontologies/mods3#> SELECT ?uri WHERE { ?uri ?o mods:Record }</a></code></li> <li>Find all URIs of all collections – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=PREFIX+mods%3A%3Chttp%3A%2F%2Fsimile.mit.edu%2F2006%2F01%2Fontologies%2Fmods3%23%3E%0D%0APREFIX+hub%3A%3Chttp%3A%2F%2Fdata.archiveshub.ac.uk%2Fdef%2F%3E%0D%0ASELECT+%3Furi+WHERE+%7B+%7B+%3Furi+%3Fo+hub%3AFindingAid+%7D+UNION+%7B+%3Furi+%3Fo+mods%3ARecord+%7D+%7D%0D%0AORDER+BY+%3Furi%0D%0A”>PREFIX mods:<http://simile.mit.edu/2006/01/ontologies/mods3#> PREFIX hub:<http://data.archiveshub.ac.uk/def/> SELECT ?uri WHERE { { ?uri ?o hub:FindingAid } UNION { ?uri ?o mods:Record } } ORDER BY ?uri</a></code></li> </ul> <p>This is a list of ontologies (namespaces) used in the triple store as predicates:</p> $list <p>For more information about SPARQL, see:</p> <ol> <li><a href=”http://www.w3.org/TR/rdf-sparql-query/” target=”_blank”>SPARQL Query Language for RDF</a> from the W3C</li> <li><a href=”http://en.wikipedia.org/wiki/SPARQL” target=”_blank”>SPARQL</a> from Wikipedia</li> </ol> <p>Source code — <a href=”http://infomotions.com/sandbox/liam/bin/sparql.pl”>sparql.pl</a> — is available online.</p> <hr /> <p> <a href=”mailto:eric_morgan\@infomotions.com”>Eric Lease Morgan <eric_morgan\@infomotions.com></a><br /> January 6, 2014 </p> </body> </html> EOF } sub namespaces { my %namespaces = ( “crm” => “http://erlangen-crm.org/current/”, “dc” => “http://purl.org/dc/elements/1.1/”, “dcterms” => “http://purl.org/dc/terms/”, “event” => “http://purl.org/NET/c4dm/event.owl#”, “foaf” => “http://xmlns.com/foaf/0.1/”, “lode” => “http://linkedevents.org/ontology/”, “lvont” => “http://lexvo.org/ontology#”, “modsrdf” => “http://simile.mit.edu/2006/01/ontologies/mods3#”, “ore” => “http://www.openarchives.org/ore/terms/”, “owl” => “http://www.w3.org/2002/07/owl#”, “rdf” => “http://www.w3.org/1999/02/22-rdf-syntax-ns#”, “rdfs” => “http://www.w3.org/2000/01/rdf-schema#”, “role” => “http://simile.mit.edu/2006/01/roles#”, “skos” => “http://www.w3.org/2004/02/skos/core#”, “time” => “http://www.w3.org/2006/time#”, “timeline” => “http://purl.org/NET/c4dm/timeline.owl#”, “wgs84_pos” => “http://www.w3.org/2003/01/geo/wgs84_pos#” ); return \%namespaces; }

    # package Apache2::LiAM::Dereference; # Dereference.pm – Redirect user-agents based on value of URI. # Eric Lease Morgan <eric_morgan@infomotions.com> # December 7, 2013 – first investigations; based on Apache2::Alex::Dereference # configure use constant PAGES => ‘http://infomotions.com/sandbox/liam/pages/’; use constant DATA => ‘http://infomotions.com/sandbox/liam/data/’; # require use Apache2::Const -compile => qw( OK ); use CGI; use strict; # main sub handler { # initialize my $r = shift; my $cgi = CGI->new; my $id = substr( $r->uri, length $r->location ); # wants RDF if ( $cgi->Accept( ‘text/html’ )) { print $cgi->header( -status => ’303 See Other’, -Location => PAGES . $id . ‘.html’, -Vary => ‘Accept’ ) } # give them RDF else { print $cgi->header( -status => ’303 See Other’, -Location => DATA . $id . ‘.rdf’, -Vary => ‘Accept’, “Content-Type” => ‘application/rdf+xml’ ) } # done return Apache2::Const::OK; } 1; # return true or die

    by Eric Lease Morgan at February 17, 2014 04:40 AM

    February 08, 2014

    LiAM: Linked Archival Metadata

    Linked data and archival practice: Or, There is more than one way to skin a cat.

    Two recent experiences have taught me that — when creating some sort of information service — linked data will reside and be mixed in with data collected from any number of Internet techniques. Linked data interfaces will coexist with REST-ful interfaces, or even things as rudimentary as FTP. To the archivist, this means linked data is not the be-all and end-all of information publishing. There is no such thing. To the application programmer, this means you will need to have experience with a ever-growing number of Internet protocols. To both it means, “There is more than one way to skin a cat.”

    Semantic Web in Libraries, 2013

    Hamburg, Germany

    Hamburg, Germany

    In October of 2013 I had the opportunity to attend the Semantic Web In Libraries conference. [1, 2] It was a three-day event attended by approximately three hundred people who could roughly be divided into two equally sized groups: computer scientists and cultural heritage institution employees. The bulk of the presentations fell into two categories: 1) publishing linked data, and 2) creating information services. The publishers talked about ontologies, human-computer interfaces for data creation/maintenance, and systems exposing RDF to the wider world. The people creating information services were invariably collecting, homogenizing, and adding value to data gathered from a diverse set of information services. These information services were not limited to sets of linked data. They also included services accessible via REST-ful computing techniques, OAI-PMH interfaces, and there were probably a few locally developed file transfers or relational database dumps described as well. These people where creating lists of information services, regularly harvesting content from the services, writing cross-walks, locally storing the content, indexing it, providing services against the result, and sometimes republishing any number of “stories” based on the data. For the second group of people, linked data was certainly not the only game in town.

    GLAM Hack Philly

    Philadelphia, United States

    Philadelphia, United States

    In February of 2014 I had the opportunity to attend a hackathon called GLAM Hack Philly. [3] A wide variety of data sets were presented for “hacking” against. Some where TEI files describing Icelandic manuscripts. Some was linked data published from the British museum. Some was XML describing digitized journals created by a vendor-based application. Some of it resided in proprietary database applications describing the location of houses in Philadelphia. Some of it had little or no computer-readable structure at all and described plants. Some of it was the wiki mark-up for local municipalities. After the attendees (there were about two dozen of us) learned about each of the data sets we self-selected and hacked away at projects of our own design. The results fell into roughly three categories: geo-referencing objects, creating searchable/browsable interfaces, and data enhancement. With the exception of the resulting hack repurposing journal content to create new art, the results were pretty typical for cultural heritage institutions. But what fascinated me was way us hackers selected our data sets. Namely, the more complete and well-structured the data was the more hackers gravitated towards it. Of all the data sets, the TEI files were the most complete, accurate, and computer-readable. Three or four projects were done against the TEI. (Heck, I even hacked on the TEI files. [4]) The linked data from the British Museum — very well structured but not quite as through at the TEI — attracted a large number of hackers who worked together for a common goal. All the other data sets had only one or two people working on them. What is the moral to the story? There are two of them. First, archivists, if you want people to process your data and do “kewl” things against it, then make sure the data is thorough, complete, and computer-readable. Second, computer programmers, you will need to know a variety of data formats. Linked data is not the only game in town.

    Summary

    In summary, the technologies described in this Guidebook are not the only way to accomplish the goals of archivists wishing to make their content more accessible. [5] Instead, linked data is just one of many protocols in the toolbox. It is open, standards-based, and simpler rather than more complex. On the other hand, other protocols exist which have a different set of strengths and weaknesses. Computer technologists will need to have a larger rather than smaller knowledge of various Internet tools. For archivists, the core of the problem is still the collection and description of content. This — a what of archival practice — continues to remain constant. It is the how of archival practice — the technology — that changes at a much faster pace.

    Links

    1. SWIB13 – http://swib.org/swib13/
    2. SWIB3 travelogue – http://blogs.nd.edu/emorgan/2013/12/swib13/
    3. hackathon – http://glamhack.com/
    4. my hack – http://dh.crc.nd.edu/sandbox/glamhack/
    5. Guidebook – http://sites.tufts.edu/liam/

    by Eric Lease Morgan at February 08, 2014 09:30 PM

    February 07, 2014

    LiAM: Linked Archival Metadata

    Archival linked data use cases

    What can you do with archival linked data once it is created? Here are three use cases:

    1. Do simple publishing – At its very root, linked data is about making your data available for others to harvest and use. While the “killer linked data application” has seemingly not reared its head, this does not mean you ought not make your data available at linked data. You won’t see the benefits immediately, but sooner or later (less than 5 years from now), you will see your content creeping into the search results of Internet indexes, into the work of both computational humanists and scientists, and into the hands of esoteric hackers creating one-off applications. Internet search engines will create “knowledge graphs”, and they will include links to your content. The humanists and scientists will operate on your data similarly. Both will create visualizations illustrating trends. They will both quantifiably analyze your content looking for patterns and anomalies. Both will probably create network diagrams demonstrating the flow and interconnection of knowledge and ideas through time and space. The humanist might do all this in order to bring history to life or demonstrate how one writer influenced another. The scientist might study ways to efficiently store your data, easily move it around the Internet, or connect it with data set created by their apparatus. The hacker (those are the good guys) will create flashy-looking applications that many will think are weird and useless, but the applications will demonstrate how the technology can be exploited. These applications will inspire others, be here one day and gone the next, and over time, become more useful and sophisticated.

    2. Create a union catalog – If you make your data available as linked data, and if you find at least one other archive who is making their data available as linked data, then you can find a third somebody who will combine them into a triple store and implement a rudimentary SPARQL interface against the union. Once this is done a researcher could conceivably search the interface for a URI to see what is in both collections. The absolute imperative key to success for this to work is the judicious inclusion of URIs in both data sets. This scenario becomes even more enticing with the inclusion of two additional things. First, the more collections in the triple store the better. You can not have enough collections in the store. Second, the scenario will be even more enticing when each archive publishes their data using similar ontologies as everybody else. Success does not hinge on similar ontologies, but success is significantly enhanced. Just like the relational databases of today, nobody will be expected to query them using their native query language (SQL or SPARQL). Instead the interfaces will be much more user-friendly. The properties of classes in ontologies will become facets for searching and browsing. Free text as well as fielded searching via drop-down menus will become available. As time goes on and things mature, the output from these interfaces will be increasingly informative, easy-to-read, and computable. This means the output will answer questions, be visually appealing, as well as be available in one or more formats for other computer programs to operate upon. 

    3. Tell a story – You and your hosting institution(s) have something significant to offer. It is not just about you and your archive but also about libraries, museums, the local municipality, etc. As a whole you are a local geographic entity. You represent something significant with a story to tell. Combine your linked data with the linked data of others in your immediate area. The ontologies will be a total hodgepodge, at least at first. Now provide a search engine against the result. Maybe you begin with local libraries or museums. Allow people to search the interface and bring together the content of everybody involved. Do not just provide lists of links in search results, but instead create knowledge graphs. Supplement the output of search results with the linked data from Wikipedia, Flickr, etc. You don’t have to be a purist. In a federated search sort of way, supplement the output with content from other data feeds such as (licensed) bibliographic indexes or content harvested from OAI-PMH repositories. Creating these sorts of things on-the-fly will be challenging. On the other hand, you might implement something that is more iterative and less immediate, but more thorough and curated if you were to select a topic or theme of interest, and do your own searching and story telling. The result would be something that is at once a Web page, a document designed for printing, or something importable into another computer program.

    This text is a part of a draft sponsored by LiAM — the Linked Archival Metadata: A Guidebook.

    by Eric Lease Morgan at February 07, 2014 02:43 AM

    February 04, 2014

    LiAM: Linked Archival Metadata

    Beginner’s glossary to linked data

    This is a beginner’s glossary to linked data. It is a part of the yet-to-be published LiAM Guidebook on linked data in archives.

    • API – (see application programmer interface)
    • application programmer interface (API) – an abstracted set of functions and commands used to get output from remote computer applications. These functions and commands are not necessarily tied to any specific programming language and therefore allow programmers to use a programming language of their choice.
    • content negotiation – a process whereby a user-agent and HTTP server mutually decide what data format will be exchanged during an HTTP request. In the world of linked data, content negotiation is very important when URIs are requested by user-agents because content negotiation helps determine whether or not HTML or serialized RDF will be returned.
    • extensible markup language (XML) – a standardized data structure made up of a minimum of rules and can be easily used to represent everything from tiny bits of data to long narrative texts. XML is designed to be read my people as well as computers, but because of this it is often considered verbose, and ironically, difficult to read.
    • HTML – (see hypertext markup language)
    • HTTP – (see hypertext transfer protocol)
    • hypertext markup language (HTML) – an XML-like data structure intended to be rendered by user-agents whose output is for people to read. For the most part, HTML is used to markup text and denote a text’s stylistic characteristics such as headers, paragraphs, and list items. It is also used do markup the hypertext links (URLs) between documents.
    • hypertext transfer protocol (HTTP) – the formal name for the way the World Wide Web operates. It begins with one computer program (a user-agent) requesting content from another computer program (a server) and getting back a response. Once received, the response is formatted for reading by a person or for processing by a computer program. The shape and content of both the request and the response are what make-up the protocol.
    • Javascript object notation (JSON) – like XML, a data structure enabling allowing arbitrarily large sets of values to associated with an arbitrarily large set of names (variables). JSON was first natively implemented as a part of the Javascript language, but has since become popular in other computer languages.
    • JSON – (see Javascript object notation)
    • linked data – the stuff and technical process for making real the ideas behind the Semantic Web. It begins with the creation of serialized RDF and making the serialization available via HTTP. User agents are then expected to harvest the RDF, combine it with other harvested RDF, and ideally use it to bring to light new or existing relationships between real world objects — people, places, and things — thus creating and enhancing human knowledge.
    • linked open data – a qualification of linked data whereby the information being exchanged is expected to be “free” as in gratis.
    • ontology – a highly structured vocabulary, and in the parlance of linked data, used to denote, describe, and qualify the predicates of RDF triples. Ontologies have been defined for a very wide range of human domains, everything from bibliography (Dublin Core or MODS), to people (FOAF), to sounds (Audio Features).
    • RDF – (see resource description framework)
    • representational state transfer (REST) – a process for querying remote HTTP servers and getting back computer-readable results. The process usually employs denoting name-value pairs in a URL and getting back something like XML or JSON.
    • resource description framework – the conceptual model for describing the knowledge of the Semantic Web. It is rooted in the notion of triples whose subjects and objects are literally linked with other triples through the use of actionable URIs.
    • REST – (see representational state transfer)
    • Semantic Web – an idea articulated by Tim Berners Lee whereby human knowledge is expressed in a computer-readable fashion and made available via HTTP so computers can harvest it and bring to light new information or knowledge.
    • serialization – a manifestation of RDF; one of any number of textual expressions of RDF triples. Examples include but are not limited to RDF/XML, RDFa, N3, and JSON-LD.
    • SPARQL – (see SPARQL protocol and RDF query language)
    • SPARQL protocol and RDF query language (SPARQL) – a formal specification for querying and returning results from RDF triple stores. It looks and operates very much like the structured query language (SQL) of relational databases complete with its SELECT, WHERE, and ORDER BY clauses.
    • triple – the atomistic facts making up RDF. Each fact is akin to a rudimentary sentence with three parts: 1) subject, 2) predicate, and 3) object. Subjects are expected to be URIs. Ideally, objects are URIs as well, but can also be literals (words, phrases, or numbers). Predicates are akin to the verbs in a sentence and they denote a relationship between the subject and object. Predicates are expected to be a member of a formalized ontology.
    • triple store – a database of RDF triples usually accessible via SPARQL
    • universal resource identifier (URI) – a unique pointer to a real-world object or a description of an object. In the parlance of linked data, URIs are expected to have the same shape and function as URLs, and if they do, then the URIs are often described as “actionable”.
    • universal resource locator (URL) – an address denoting the location of something on the Internet. These addresses usually specify a protocol (like http), a host (or computer) where the protocol is implemented, and a path (directory and file) specifying where on the computer the item of interest resides.
    • URI – (see universal resource identifier)
    • URL – (see universal resource locator)
    • user agent – this is the formal name for what is commonly called a “Web browser”, but Web browsers usually denote applications where people are viewing the results. User agents are usually “Web browsers” whose readers are computer programs.
    • XML – (see extensible markup language)

    For a more complete and exhaustive glossary, see the W3C’s Linked Data Glossary.

    by Eric Lease Morgan at February 04, 2014 01:47 AM

    January 31, 2014

    LiAM: Linked Archival Metadata

    RDF serializations

    RDF can be expressed in many different formats, called “serializations”.

    RDF (Resource Description Framework) is a conceptual data model made up of “sentences” called triples — subjects, predicates, and objects. Subjects are expected to be URIs. Objects are expected to be URIs or string literals (think words, phrases, or numbers). Predicates are “verbs” establishing relationships between the subjects and the objects. Each triple is intended to denote a specific fact.

    When the idea of the Semantic Web was first articulated XML was the predominant data structure of the time. It was seen as a way to encapsulate data that was both readable by humans as well as computers. Like any data structure, XML has both its advantages as well as disadvantages. On one hand it is easy to determine whether or not XML files are well-formed, meaning they are syntactically correct. Given a DTD, or better yet, an XML schema, it is also easy determine whether or not an XML file is valid — meaning does it contain the necessary XML elements, attributes, and are they arranged and used in the agreed upon manner. XML also lends itself to transformations into other plain text documents through the generic, platform-independent XSLT (Extensible Stylesheet Language Transformation) process. Consequently, RDF was originally manifested — made real and “serialized” — though the use of RDF/XML. The example of RDF at the beginning of the Guidebook was an RDF/XML serialization:

    
    <?xml version="1.0"?>
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns:dcterms="http://purl.org/dc/terms/"
             xmlns:foaf="http://xmlns.com/foaf/0.1/">
      <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Declaration_of_Independence">
        <dcterms:creator>
          <foaf:Person rdf:about="http://id.loc.gov/authorities/names/n79089957">
            <foaf:gender>male</foaf:gender>
          </foaf:Person>
        </dcterms:creator>
      </rdf:Description>
    </rdf:RDF>

    This RDF can be literally illustrated with the graph, below:

    On the other hand, XML, almost by definition, is verbose. Element names are expected to be human-readable and meaningful, not obtuse nor opaque. The judicious use of special characters (&, <, >, “, and ‘) as well as entities only adds to the difficulty of actually reading XML. Consequently, almost from the very beginning people thought RDF/XML was not the best way to express RDF, and since then a number of other syntaxes — data structures — have manifested themselves.

    Below is the same RDF serialized in a format called Notation 3 (N3), which is very human readable, but not extraordinarily structured enough for computer processing. It incorporates the use of a line-based data structure called N-Triples used to denote the triples themselves:

    @prefix foaf: <http://xmlns.com/foaf/0.1/>.
    @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
    @prefix dcterms: <http://purl.org/dc/terms/>.
    <http://en.wikipedia.org/wiki/Declaration_of_Independence> dcterms:creator <http://id.loc.gov/authorities/names/n79089957>.
    <http://id.loc.gov/authorities/names/n79089957> a foaf:Person;
    	foaf:gender "male".

    JSON (JavaScript Object Notation) is a popular data structure inherent to the use of JavaScript and Web browsers, and RDF can be expressed in a JSON format as well:

    {
      "http://en.wikipedia.org/wiki/Declaration_of_Independence": {
        "http://purl.org/dc/terms/creator": [
          {
            "type": "uri", 
            "value": "http://id.loc.gov/authorities/names/n79089957"
          }
        ]
      }, 
      "http://id.loc.gov/authorities/names/n79089957": {
        "http://xmlns.com/foaf/0.1/gender": [
          {
            "type": "literal", 
            "value": "male"
          }
        ], 
        "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
          {
            "type": "uri", 
            "value": "http://xmlns.com/foaf/0.1/Person"
          }
        ]
      }
    }
    

    Just about the newest RDF serialization is an embellishment of JSON called JSON-LD. Compare & contrasts the serialization below to the one above:

    {
      "@graph": [
        {
          "@id": "http://en.wikipedia.org/wiki/Declaration_of_Independence",
          "http://purl.org/dc/terms/creator": {
            "@id": "http://id.loc.gov/authorities/names/n79089957"
          }
        },
        {
          "@id": "http://id.loc.gov/authorities/names/n79089957",
          "@type": "http://xmlns.com/foaf/0.1/Person",
          "http://xmlns.com/foaf/0.1/gender": "male"
        }
      ]
    }
    

    RDFa represents a way of expressing RDF embedded in HTML, and here is such an expression:

    <div xmlns="http://www.w3.org/1999/xhtml"
      prefix="
        foaf: http://xmlns.com/foaf/0.1/
        rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
        dcterms: http://purl.org/dc/terms/
        rdfs: http://www.w3.org/2000/01/rdf-schema#"
      >
      <div typeof="rdfs:Resource" about="http://en.wikipedia.org/wiki/Declaration_of_Independence">
        <div rel="dcterms:creator">
          <div typeof="foaf:Person" about="http://id.loc.gov/authorities/names/n79089957">
            <div property="foaf:gender" content="male"></div>
          </div>
        </div>
      </div>
    </div>
    

    The purpose of publishing linked data is to make RDF triples easily accessible. This does not necessarily mean the transformation of EAD or MARC into RDF/XML, but rather making accessible the statements of RDF within the context of the reader. In this case, the reader may be a human or some sort of computer program. Each serialization has its own strengths and weaknesses. Ideally the archive would figure out ways exploit each of these RDF serializations in specific publishing purposes.

    For a good time, play with the RDF Translator which will convert one RDF serialization into another.

    The RDF serialization process also highlights how data structures are moving away from document-centric models to statement-central models. This too has consequences for way cultural heritage institutions, like archives, think about exposing their metadata, but that is the topic of another essay.

    by Eric Lease Morgan at January 31, 2014 06:07 PM

    January 26, 2014

    LiAM: Linked Archival Metadata

    CURL and content-negotiation

    This is the tiniest introduction to cURL and content-negotiation. It is a part of the to-be-published-in-April Linked Archival Metadata guidebook.

    CURL is a command-line tool making it easier for you to see the Web as data and not presentation. It is a sort of Web browser, but more specifically, it is a thing called a user-agent. Content-negotiation is an essential technique for publishing and making accessible linked data. Please don’t be afraid of the command-line though. Understanding how to use cURL and do content-negotiation by hand will take you a long way in understanding linked data.

    The first step is to download and install cURL. If you have a Macintosh or a Linux computer, then it is probably already installed. If not, then give the cURL download wizard a whirl. We’ll wait.

    Next, you need to open a terminal. On Macintosh computers a terminal application is located in the Utilities folder of your Applications folder. It is called “Terminal”. People using Windows-based computers can find the “Command” application by searching for it in the Start Menu. Once cURL has been installed and a terminal has been opened, then you can type the following command at the prompt to display a help text:

    curl --help

    There are many options there, almost too many. It is often useful to view only one page of text at a time, and you can “pipe” the output through to a program called “more” to do this. By pressing the space bar, you can go forward in the display. By pressing “b” you can go backwards, and by pressing “q” you can quit:

    curl --help | more

    Feed cURL the complete URL of Google’s home page to see how much content actually goes into their “simple” presentation:

    curl http://www.google.com/ | more

    The communication of the World Wide Web (the hypertext transfer protocol or HTTP) is divided into two parts: 1) a header, and 2) a body. By default cURL displays the body content. To see the header, add the -I (for a mnemonic, think “information”) to the command:

    curl -I http://www.google.com/

    The result will be a list of characteristics the remote Web server is using to describe this particular interaction between itself and you. The most important things to note are: 1) the status line and 2) the content type. The status line will be the first line in the result, and it will say something like “HTTP/1.1 200 OK”, meaning there were no errors. Another line will begin with “Content-Type:” and denotes the format of the data being transferred. In most cases the content type line will include something like “text/html” meaning the content being sent is in the form of an HTML document.

    Now feed cURL a URI for Walt Disney, such as one from DBpedia:

    curl http://dbpedia.org/resource/Walt_Disney

    The result will be empty, but upon the use of the -I switch you can see how the status line changed to “HTTP/1.1 303 See Other”. This means there is no content at the given URI, and the line starting with “Location:” is a pointer — an instruction — to go to a different document. In the parlance of HTTP this is called redirection. Using cURL going to the recommended location results in a stream of HTML:

    curl http://dbpedia.org/page/Walt_Disney | more

    Most Web browsers automatically follow HTTP redirection commands, but cURL needs to be told this explicitly through the use of the -L switch. (Think “location”.) Consequently, given the original URI, the following command will display HTML even though the URI has no content:

    curl -L http://dbpedia.org/resource/Walt_Disney | more

    Now remember, the Semantic Web and linked data depend on the exchange of RDF, and upon closer examination you can see there are “link” elements in the resulting HTML, and these elements point to URLs with the .rdf extension. Feed these URLs to cURL to see an RDF representation of the Walt Disney data:

    curl http://dbpedia.org/data/Walt_Disney.rdf | more

    Downloading entire HTML streams, parsing them for link elements containing URLs of RDF, and then requesting the RDF is not nearly as efficient as requesting RDF from the remote server in the first place. This can be done by telling the remote server you accept RDF as a format type. This is accomplished through the use of the -H switch. (Think “header”.) For example, feed cURL the URI for Walt Disney and specify your desire for RDF/XML:

    curl -H "Accept: application/rdf+xml" http://dbpedia.org/resource/Walt_Disney

    Ironically, the result will be empty, and upon examination of the HTTP headers (remember the -I switch) you can see that the RDF is located at a different URL, namely, http://dbpedia.org/data/Walt_Disney.xml:

    curl -I -H "Accept: application/rdf+xml" http://dbpedia.org/resource/Walt_Disney

    Finally, using the -L switch, you can use the URI for Walt Disney to request the RDF directly:

    curl -L -H "Accept: application/rdf+xml" http://dbpedia.org/resource/Walt_Disney

    That is cURL and content-negotiation in a nutshell. A user-agent submits a URI to a remote HTTP server and specifies the type of content it desires. The HTTP server responds with URLs denoting the location of desired content. The user-agent then makes a more specific request. It is sort of like the movie. “One URI to rule them all.” In summary, remember:

    1. cURL is a command-line user-agent
    2. given a URL, cURL returns, by default, the body of an HTTP transaction
    3. the -I switch allows you to see the HTTP header
    4. the -L switch makes cURL automatically follow HTTP redirection requests
    5. the -H switch allows you to specify the type of content you wish to accept
    6. given a URI and the use of the -L and -H switches you are able to retrieve either HTML or RDF

    Use cURL to actually see linked data in action, and here are a few more URIs to explore:

    • Walt Disney via VIAF – http://viaf.org/viaf/36927108/
    • origami via the Library of Congress – http://id.loc.gov/authorities/subjects/sh85095643
    • Paris from DBpedia – http://dbpedia.org/resource/Paris

    by Eric Lease Morgan at January 26, 2014 08:29 PM

    January 22, 2014

    LiAM: Linked Archival Metadata

    Questions from a library science student about RDF and linked data

    Yesterday I received the following set of questions from a library school student concerting RDF and linked data. They are excellent questions and the sort I am expected to answer in the LiAM Guidebook. I thought I’d post my reply here. Their identity has been hidden to protect the innocent.

    I’m writing you to ask you about your thoughts on implementing these kinds of RDF descriptions for institutional collections. Have you worked on a project that incorporated LD technologies like these descriptions? What was that experience like? Also, what kind of strategies have you used to implement these strategies, for instance, was considerable buy-in from your organization necessary, or were you able to spearhead it relatively solo? In essence, what would it “cost” to really do this?

    I apologize for the mass of questions, especially over e-mail. My only experience with ontology work has been theoretical, and I haven’t met any professionals in the field yet who have actually used it. When I talk to my mentors about it, they are aware of it from an academic standpoint but are wary of it due these questions of cost and resource allocation, or confusion that it doesn’t provide anything new for users. My final project was to build an ontology to describe some sort of resource and I settled on building a vocabulary to describe digital facsimiles and their physical artifacts, but I have yet to actually implement or use any of it. Nor have I had a chance yet to really use any preexisting vocabularies. So I’ve found myself in a slightly frustrating position where I’ve studied this from an academic perspective and seek to incorporate it in my GLAM work, but I lack the hands-on opportunity to play around with it.


    MLIS Candidate

    Dear MLIS Candidate, no problem, really, but I don’t know how much help I will really be.

    The whole RDF / Semantic Web thing started more than ten years ago. The idea was to expose RDF/XML, allow robots to crawl these files, amass the data, and discover new knowledge — relationships — underneath. Many in the library profession thought this was science fiction and/or the sure pathway to professional obsolescence. Needless to say, it didn’t get very far. A few years ago the idea of linked data was articulated, and it a nutshell it outlined how to make various flavors of serialized RDF available via an HTTP technique called content negotiation. This was when things like Turtle, N3, the idea of triple stores, maybe SPARQL, and other things came to fruition. This time the idea of linked data was more real and got a bit more traction, but it is still not main stream.

    I have very little experience putting the idea of RDF and linked data into practice. A long time ago I created RDF versions of my Alex Catalogue and implemented a content negotiation tool against it. The Catalogue was not a part of any institution other than myself. When I saw the call for the LiAM Guidebook I applied and got the “job” because of my Alex Catalogue experiences as well as some experience with a thing colloquially called The Catholic Portal which contains content from EAD files.

    I knew this previously, but linked data is all about URIs and ontologies. Minting URIs is not difficult, but instead of rolling your own, it is better to use the URIs of others, such as the URIs in DBpedia, GeoNames, VIAF, etc. The ontologies used to create relationships between the URIs are difficult to identify, articulate, and/or use consistently. They are manifestations of human language, and human language is ambiguous. Trying to implement the nuances of human language in computer “sentences” called RDF triples is only an approximation at best. I sometimes wonder if the whole thing can really come to fruition. I look at OAI-PMH. It had the same goals, but it was finally called not a success because it was too difficult to implement. The Semantic Web may follow suit.

    That said, it is not too difficult to make the metadata of just about any library or archive available as linked data. The technology is inexpensive and already there. The implementation will not necessarily implement best practices, but it will not expose incorrect nor invalid data, just data that is not the best. Assuming the library has MARC or EAD files, it is possible to use XSL to transform the metadata into RDF/XML. HTML and RDF/XML versions of the metadata can then be saved on an HTTP file system and disseminated either to humans or robots through content negotiation. Once a library or archive gets this far they can then either improve their MARC or EAD files to include more URIs, they can improve their XSLT to take better advantage of shared ontologies, and/or they can dump MARC and EAD all together and learn to expose linked data directly from (relational) databases. It is an iterative process which is never completed.

    Nothing new to users? Ah, that is the rub and a sticking point with the linked data / Semantic Web thing. It is a sort of chicken & egg problem. “Show me the cool application that can be created if I expose my metadata as linked data”, say some people. On the other hand, “I can not create the cool application until there is a critical mass of available content.” Despite this issue, things are happening for readers, namely mash-ups. (I don’t like the word “users”.) Do a search in Facebook for the Athens. You will get a cool looking Web page describing Athens, who has been there, etc. This was created by assembling metadata from a host of different places (all puns intended), and one of those places were linked data repositories. Do a search in Google for the same thing. Instead of just bringing back a list of links, Google presents you with real content, again, amassed through various APIs including linked data. Visit VIAF and search for a well-known author. Navigate the result an you will maybe end up at WorldCat identities where all sorts of interesting information about an author, who they wrote with, what they wrote, and where is displayed. All of this is rooted in linked data and Web Services computing techniques. This is where the benefit comes. Library and archival metadata will become part of these mash-up — called “named graphs” — driving readers to library and archival websites. Linked data can become part of Wikipedia. It can be used to enrich descriptions of people in authority lists, gazetteers, etc.

    What is the cost? Good question. Time is the biggest expense. If a person knows what they are doing, then a robust set of linked data could be exposed in less than a month, but in order to get that far many people need to contribute. Systems types to get the data out of content management systems as well as set up HTTP servers. Programmers will be needed to do the transformations. Catalogers will be needed to assist in the interpretation of AACR2 cataloging practices, etc. It will take a village to do the work, and that doesn’t even count convincing people this is a good idea.

    Your frustration is not uncommon. Often times if there is not a real world problem to solve, learning anything new when it comes to computers is difficult. I took BASIC computer programming three times before anything sunk in, and it only sunk in when I needed to calculate how much money I was earning as a taxi driver.

    linked data sets as of 2011

    linked data sets as of 2011


    Try implementing one of your passions. Do you collect anything? Baseball cards? Flowers? Books? Records? Music? Art? Is there something in your employer’s special collections of interest to you? Find something of interest to you. For simplicity’s sake, use a database to describe each item in the collection making sure each record in our database includes a unique key field. Identify one or more ontologies (others as well as rolling your own) whose properties match closely the names of your fields in your database. Write a program against your database to create static HTML pages. Put the pages on the Web. Write a program against your database to create static RDF/XML (or some other RDF serialization). Put the pages on the Web. Implement a content negotiation script that takes the keys to your database’s fields as input and redirects HTTP user agents to either the HTML or RDF. Submit the root of your linked data implementation to Datahub.io. Ta da! You have successfully implemented linked data and learned a whole lot along the way. Once you get that far you can take what you have learned and apply it in a bigger and better way for a larger set of data.

    On one hand the process is not difficult. It is a matter of repurposing the already existing skills of people who work in cultural heritage institutions. On the other hand, change in the ways things are done is difficult (but not as difficult in the what of what is done). The change is difficult to balance existing priorities. Exposing library and archival content as linked data represents a different working style, but the end result is the same — making the content of our collections available for use and understanding.

    HTH.

    by Eric Lease Morgan at January 22, 2014 02:59 AM

    January 21, 2014

    Life of a Librarian

    Paper Machines

    Today I learned about Paper Machines, a very useful plug-in for Zotero allowing the reader to do visualizations against their collection of citations.

    Today Jeffrey Bain-Conkin pointed me towards a website called Lincoln Logarithms where sermons about Abraham Lincoln and slavery were analyzed. To do some of the analysis a Zotero plug-in called Paper Machines was used, and it works pretty well:

    1. make sure Zotero’s full text indexing feature is turned on
    2. download and install Paper Machines
    3. select one of your Zotero collections to be analyzed
    4. wait
    5. select any one of a number of visualizations to create

    I am in the process of writing a book on linked data for archivists. I have am using Zotero to keep track the books citations, etc. I used Paper Machines to generate the following images:

    word cloud
    a word cloud where the words are weighted by a TF-IDF score

    phrase net
    a “phrase net” where the words are joined by the word “is”

    topic model
    a topic modeling map — illustration of automatically classified documents

    From these visualizations I learned:

    • not much from the word cloud except what I already knew
    • the word “data” is connected to many different actions
    • I have few, if any, citations in my collection from the mid-2000′s

    I have often thought collections of metadata (citations, MARC records, the output from JSTOR’s Data For Research service) could easily be evaluated and visualized. Paper Machines does just that. I wish I had done it. (To some extent, I have, but the work is fledgling and called JSTOR Tool.)

    In any event, if you use Zotero, then I suggest you give Paper Machines a try.

    by Eric Lease Morgan at January 21, 2014 09:36 PM

    January 19, 2014

    LiAM: Linked Archival Metadata

    Linked Archival Metadata: A Guidebook — a fledgling draft

    For simplicity’s sake, I am making both a PDF and ePub version of the fledgling book called Linked Archival Metadata: A Guidebook available in this blog posting. It includes some text in just about every section of the Guidebook’s prospectus. Feedback is most desired.

    by Eric Lease Morgan at January 19, 2014 02:51 AM

    RDF ontologies for archival descriptions

    If you were to select a set of RDF ontologies intended to be used in the linked data of archival descriptions, then what ontologies would you select?

    For simplicity’s sake, RDF ontologies are akin to the fields in MARC records or the entities in EAD/XML files. Articulated more accurately, they are the things denoting relationships between subjects and objects in RDF triples. In this light, they are akin to the verbs in all but the most simplistic of sentences. But if they are akin to verbs, then they bring with them all of the nuance and subtlety of human written language. And human written language, in order to be an effective human communications device, comes with two equally important prerequisites: 1) a writer who can speak to an intended audience, and 2) a reader with a certain level of intelligence. A writer who does not use the language of the intended audience speaks to few, and a reader who does not “bring something to the party” goes away with litte understanding. Because the effectiveness of every writer is not perfect, and because not every reader comes to the party with a certain level of understanding, written language is imperfect. Similarly, the ontologies of linked data are imperfect. There are no perfect ontologies nor absolutely correct uses of them. There are only best practices and common usages.

    This being the case, ontologies still need to be selected in order for linked data to be manifested. What ontologies would you suggest be used when creating linked data for archival descriptions? Here are a few possibilities, listed in no priority order:

    • Dublin Core Terms – This ontology is rather bibliographic in nature, and provides a decent framework for describing much of the content of archival descriptions.
    • FOAF – Archival collections often originate from individual people. Such is the scope of FOAF, and FOAF is used by a number of other sets of linked data.
    • MODS – Because many archival descriptions are rooted in MARC records, and MODS is easily mapped from MARC.
    • Schema.org – This is an up-and-coming ontology heralded by the 600-pound gorillas in the room — Google, Microsoft, Yahoo, etc. While the ontology has not been put into practice for very long, it is growing and wide ranging.
    • RDF – This ontology is necessary because linked data is manifested as… RDF
    • RDFS – This ontology may be necessary because the archival community may be creating some of its own ontologies.
    • OWL and SKOS – Both of these ontologies seem to be used to denote relationships between terms in other ontologies. In this way they are used to create classification schemes and thesauri. For example, they allow the implementor to that “creator” in one ontology is the same as “author” in another ontology. Or they allow “country” in one ontology to be denoted as a parent geographic term for “city” in another ontology.

    While some or all of these ontologies may be useful for linked data of archival descriptions, what might some other ontologies include? (Remember, it is often “better” to select existing ontologies rather than inventing, unless there is something distinctly unique about a particular domain.) For example, how about an ontology denoting times? Or how about one for places? FOAF is good for people, but what about organizations or institutions?

    Inquiring minds would like to know.

    by Eric Lease Morgan at January 19, 2014 02:34 AM

    January 17, 2014

    Life of a Librarian

    Simple text analysis with Voyant Tools

    Voyant Tools is a Web-based application for doing a number of straight-forward text analysis functions, including but not limited to: word counts, tag cloud creation, concordancing, and word trending. Using Voyant Tools a person is able to read a document “from a distance”. It enables the reader to extract characteristics of a corpus quickly and accurately. Voyant Tools can be used to discover underlying themes in texts or verify propositions against them. This one-hour, hands-on workshop familiarizes the student with Voyant Tools and provides a means for understanding the concepts of text mining. (This document is also available as a PDF document suitable for printing.)

    Getting started

    Voyant Tools is located at http://voyant-tools.org, and the easiest way to get started is by pasting into its input box a URL or a blob of text. For learning purposes, enter one of the URL’s found at the end of this document, select from Thoreau’s Walden, Melville’s Moby Dick, or Twain’s Eve’s Diary, or enter a URL of your own choosing. Voyant Tools can read the more popular file formats, so URL’s pointing to PDF, Word, RTF, HTMl, and XML files will work well. Once given a URL, Voyant Tools will retrieve the associated text and do some analysis. Below is what is displayed when Walden is used as an example.

    Voyant Tools
    Voyant Tools

    In the upper left-hand corner is a word cloud. In the lower-left hand corner are some statistics. The balance of the screen is made up of the text. The word cloud probably does not provide you with very much useful information because stop words have not been removed from the analysis. By clicking on the word cloud customization link, you can choose from a number of stop word sets, and the result will make much more sense. Figure #2 illustrates the appearance of the word cloud once the English stop words are employed.

    By selecting words from the word cloud a word trends graph appears illustrating the relative frequency of the selection compared to its location in the text. You can use this tool to determine the consistency of the theme throughout the text. You can compare the frequency of additional words by entering them into the word trends search box. Figure #3 illustrates the frequency of the words pond and ice.

    word cloud
    Figure 2 – word cloud
    word trends
    Figure 3 – word trends
    concordance
    Figure 4 – concordance

    Once you select a word from the word cloud, a concordance appears in the lower right hand corner of the screen. You can use this tool to: 1) see what words surround your selected word, and 2) see how the word is used in the context of the entire work. Figure #4 is an illustration of the concordance. The set of horizontal blue lines in the center of the screen denote where the selected word is located in the text. The darker the blue line is the more times the selected word appears in that area of the text.

    What good is this?

    On the surface of things you might ask yourself, “What good is this?” The answer lies in your ability to ask different types of questions against a text — questions you may or may not have been able to ask previously but are now able to ask because things like Voyant Tools count and tabulate words. Questions like:

    • What ara the most frequently used words in a text?
    • What words do not appear at all or appear infrequently?
    • Do any of these words represent any sort of theme?
    • Where do these words appear in the text, and how they compare to their synonyms or antonyms?
    • Where should a person go looking in the text for the use of particular words or their representative themes?

    More features

    Voyant Tools includes a number of other features. For example, multiple URLs can be entered into the home page’s input box. This enables the reader to examine many documents all at one time. (Try adding all the URLs at the end of the document.) After doing so many of the features of Voyant Tools work in a similar manner, but others become more interesting. For example, the summary pane in the lower left corner allows you to compare words across documents. (Consider applying stop words feature to the pane in order to make things more meaningful.) Each of Voyant Tools’ panes can be exported to HTML files or linked from other documents. This is facilitated by clicking on the small icons in the upper right-hand corner of each pane. Use this feature to embed Voyant illustrations into Web pages or printed documents. By exploring the content of a site called Hermeneuti.ca (http:/hermeneuti.ca) you can discover other features of Voyant Tools as well as other text mining applications.

    The use of Voyant Tools represents an additional way of analyzing text(s). By counting and tabulating words, it provides a quick and easy quantitative method for learning what is in a text and what it might have to offer. The use of Voyant Tools does not offer “truth” per se, only new ways at observation.

    Sample links

    [1] Walden – http://infomotions.com/etexts/philosophy/1800-1899/thoreau-walden-186.txt
    [2] Civil Disobedience – http://infomotions.com/etexts/philosophy/1800-1899/thoreau-life-183.txt
    [3] Merrimack River – http://infomotions.com/etexts/gutenberg/dirs/etext03/7cncd10.txt

    by Eric Lease Morgan at January 17, 2014 10:39 PM

    January 09, 2014

    LiAM: Linked Archival Metadata

    LiAM Guidebook tools

    This is an unfinished and barely refined list of linked data tools — a “webliography” — from the forthcoming LiAM Guidebook. It is presented here simple to give an indication of what appears in the text. These citations are also available as RDF, just for fun.

    1. “4store – Scalable RDF Storage.” Accessed November 12, 2013. http://4store.org/.
    2. “Apache Jena – Home.” Accessed November 11, 2013. http://jena.apache.org/.
    3. “Behas/oai2lod · GitHub.” Accessed November 3, 2013. https://github.com/behas/oai2lod.
    4. “BIBFRAME.ORG :: Bibliographic Framework Initiative – Overview.” Accessed November 3, 2013. http://bibframe.org/.
    5. “Ckan – The Open Source Data Portal Software.” Accessed November 3, 2013. http://ckan.org/.
    6. “Community | Tableau Public.” Accessed November 3, 2013. http://www.tableausoftware.com/public/community.
    7. “ConverterToRdf – W3C Wiki.” Accessed November 11, 2013. http://www.w3.org/wiki/ConverterToRdf.
    8. “Curl and Libcurl.” Accessed November 3, 2013. http://curl.haxx.se/.
    9. “D2R Server | The D2RQ Platform.” Accessed November 15, 2013. http://d2rq.org/d2r-server.
    10. “Disco Hyperdata Browser.” Accessed November 3, 2013. http://wifo5-03.informatik.uni-mannheim.de/bizer/ng4j/disco/.
    11. “Ead2rdf.” Accessed November 3, 2013. http://data.archiveshub.ac.uk/xslt/ead2rdf.xsl.
    12. “Ewg118/eaditor · GitHub.” Accessed November 3, 2013. https://github.com/ewg118/eaditor.
    13. “Google Drive.” Accessed November 3, 2013. http://www.google.com/drive/apps.html.
    14. Library, The standard EAC-CPF is maintained by the Society of American Archivists in partnership with the Berlin State. “Society of American Archivists and the Berlin State Library.” Accessed January 1, 2014. http://eac.staatsbibliothek-berlin.de/.
    15. “Lmf – Linked Media Framework – Google Project Hosting.” Accessed November 3, 2013. https://code.google.com/p/lmf/.
    16. “OpenLink Data Explorer Extension.” Accessed November 3, 2013. http://ode.openlinksw.com/.
    17. “openRDF.org: Home.” Accessed November 12, 2013. http://www.openrdf.org/.
    18. “OpenRefine (OpenRefine) · GitHub.” Accessed November 3, 2013. https://github.com/OpenRefine/.
    19. “Parrot, a RIF and OWL Documentation Service.” Accessed November 11, 2013. http://ontorule-project.eu/parrot/parrot.
    20. “RDF2RDF – Converts RDF from Any Format to Any.” Accessed December 5, 2013. http://www.l3s.de/~minack/rdf2rdf/.
    21. “RDFImportersAndAdapters – W3C Wiki.” Accessed November 3, 2013. http://www.w3.org/wiki/RDFImportersAndAdapters.
    22. “RDFizers – SIMILE.” Accessed November 11, 2013. http://simile.mit.edu/wiki/RDFizers.
    23. “Semantic Web Client Library.” Accessed November 3, 2013. http://wifo5-03.informatik.uni-mannheim.de/bizer/ng4j/semwebclient/.
    24. “SIMILE Widgets | Exhibit.” Accessed November 11, 2013. http://www.simile-widgets.org/exhibit/.
    25. “SparqlImplementations – W3C Wiki.” Accessed November 3, 2013. http://www.w3.org/wiki/SparqlImplementations.
    26. “swh/Perl-SPARQL-client-library · GitHub.” Accessed November 3, 2013. https://github.com/swh/Perl-SPARQL-client-library.
    27. “Tabulator: Generic Data Browser.” Accessed November 3, 2013. http://www.w3.org/2005/ajar/tab.
    28. “TaskForces/CommunityProjects/LinkingOpenData/SemWebClients – W3C Wiki.” Accessed November 5, 2013. http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/SemWebClients.
    29. “TemaTres Controlled Vocabulary Server.” Accessed November 3, 2013. http://www.vocabularyserver.com/.
    30. “The D2RQ Platform – Accessing Relational Databases as Virtual RDF Graphs.” Accessed November 3, 2013. http://d2rq.org/.
    31. “The Protégé Ontology Editor and Knowledge Acquisition System.” Accessed November 3, 2013. http://protege.stanford.edu/.
    32. “Tools – Semantic Web Standards.” Accessed November 3, 2013. http://www.w3.org/2001/sw/wiki/Tools.
    33. “Tools | Linked Data – Connect Distributed Data Across the Web.” Accessed November 3, 2013. http://linkeddata.org/tools.
    34. “Vapour, a Linked Data Validator.” Accessed November 11, 2013. http://validator.linkeddata.org/vapour.
    35. “VirtuosoUniversalServer – W3C Wiki.” Accessed November 3, 2013. http://www.w3.org/wiki/VirtuosoUniversalServer.
    36. “W3C RDF Validation Service.” Accessed November 3, 2013. http://www.w3.org/RDF/Validator/.
    37. “W3c/rdfvalidator-ng.” Accessed December 10, 2013. https://github.com/w3c/rdfvalidator-ng.
    38. “Working with RDF with Perl.” Accessed November 3, 2013. http://www.perlrdf.org/.

    by Eric Lease Morgan at January 09, 2014 02:31 AM

    LiAM Guidebook linked data sites

    This is an unfinished and barely refined list of linked data sites — a “webliography” — from the forthcoming LiAM Guidebook. It is presented here simple to give an indication of what appears in the text. These citations are also available as RDF, just for fun.

    1. “(LOV) Linked Open Vocabularies.” Accessed November 3, 2013. http://lov.okfn.org/dataset/lov/.
    2. “Data Sets & Services.” Accessed November 3, 2013. http://www.oclc.org/data/data-sets-services.en.html.
    3. “Data.gov.uk.” Accessed November 3, 2013. http://data.gov.uk/.
    4. “Freebase.” Accessed November 3, 2013. http://www.freebase.com/.
    5. “GeoKnow/LinkedGeoData · GitHub.” Accessed November 3, 2013. https://github.com/GeoKnow/LinkedGeoData.
    6. “GeoNames.” Accessed November 3, 2013. http://www.geonames.org/.
    7. “Getty Union List of Artist Names (Research at the Getty).” Accessed November 3, 2013. http://www.getty.edu/research/tools/vocabularies/ulan/.
    8. “Home – LC Linked Data Service (Library of Congress).” Accessed November 3, 2013. http://id.loc.gov/.
    9. “Home | Data.gov.” Accessed November 3, 2013. http://www.data.gov/.
    10. “ISBNdb – A Unique Book & ISBN Database.” Accessed November 3, 2013. http://isbndb.com/.
    11. “Linked Movie Data Base | Start Page.” Accessed November 3, 2013. http://linkedmdb.org/.
    12. “MusicBrainz – The Open Music Encyclopedia.” Accessed November 3, 2013. http://musicbrainz.org/.
    13. “New York Times – Linked Open Data.” Accessed November 3, 2013. http://data.nytimes.com/.
    14. “PELAGIOS: About PELAGIOS.” Accessed September 4, 2013. http://pelagios-project.blogspot.com/p/about-pelagios.html.
    15. “Start Page | D2R Server for the CIA Factbook.” Accessed November 3, 2013. http://wifo5-03.informatik.uni-mannheim.de/factbook/.
    16. “Start Page | D2R Server for the Gutenberg Project.” Accessed November 3, 2013. http://wifo5-03.informatik.uni-mannheim.de/gutendata/.
    17. “VIAF.” Accessed August 27, 2013. http://viaf.org/.
    18. “Web Data Commons.” Accessed November 19, 2013. http://webdatacommons.org/.
    19. “Welcome – the Datahub.” Accessed August 14, 2013. http://datahub.io/.
    20. “Welcome to Open Library (Open Library).” Accessed November 3, 2013. https://openlibrary.org/.
    21. “Wiki.dbpedia.org : About.” Accessed November 3, 2013. http://dbpedia.org/About.
    22. “World Bank Linked Data.” Accessed November 3, 2013. http://worldbank.270a.info/.html.

    by Eric Lease Morgan at January 09, 2014 02:25 AM

    LiAM Guidebook citations

    This is an unfinished and barely refined list of citations — a “webliography” — from the forthcoming LiAM Guidebook. It is presented here simple to give an indication of what appears in the text. These citations are also available as RDF, just for fun.

    1. admin. “Barriers to Using EAD,” August 4, 2012. http://oclc.org/research/activities/eadtools.html.
    2. Becker, Christian, and Christian Bizer. “Exploring the Geospatial Semantic Web with DBpedia Mobile.” Web Semantics: Science, Services and Agents on the World Wide Web 7, no. 4 (December 2009): 278–286. doi:10.1016/j.websem.2009.09.004.
    3. Belleau, François, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, and Jean Morissette. “Bio2RDF: Towards a Mashup to Build Bioinformatics Knowledge Systems.” Journal of Biomedical Informatics 41, no. 5 (October 2008): 706–716. doi:10.1016/j.jbi.2008.03.004.
    4. Berners-Lee, Tim. “Linked Data – Design Issues.” Accessed August 4, 2013. http://www.w3.org/DesignIssues/LinkedData.html.
    5. Berners-Lee, Tim, James Hendler, and Ora Lassila. “The Semantic Web.” Scientific American 284, no. 5 (May 2001): 34–43. doi:10.1038/scientificamerican0501-34.
    6. Bizer, Christian, Tom Heath, and Tim Berners-Lee. “Linked Data – The Story So Far:” International Journal on Semantic Web and Information Systems 5, no. 3 (33 2009): 1–22. doi:10.4018/jswis.2009081901.
    7. Carroll, Jeremy J., Christian Bizer, Pat Hayes, and Patrick Stickler. “Named Graphs.” Web Semantics: Science, Services and Agents on the World Wide Web 3, no. 4 (December 2005): 247–267. doi:10.1016/j.websem.2005.09.001.
    8. “Chem2bio2rdf – How to Publish Data Using D2R?” Accessed January 6, 2014. http://chem2bio2rdf.wikispaces.com/How+to+publish+data+using+D2R%3F.
    9. “Content Negotiation.” Wikipedia, the Free Encyclopedia, July 2, 2013. https://en.wikipedia.org/wiki/Content_negotiation.
    10. “Cool URIs for the Semantic Web.” Accessed November 3, 2013. http://www.w3.org/TR/cooluris/.
    11. Correndo, Gianluca, Manuel Salvadores, Ian Millard, Hugh Glaser, and Nigel Shadbolt. “SPARQL Query Rewriting for Implementing Data Integration over Linked Data.” 1. ACM Press, 2010. doi:10.1145/1754239.1754244.
    12. David Beckett. “Turtle.” Accessed August 6, 2013. http://www.w3.org/TR/2012/WD-turtle-20120710/.
    13. “Debugging Semantic Web Sites with cURL | Cygri’s Notes on Web Data.” Accessed November 3, 2013. http://richard.cyganiak.de/blog/2007/02/debugging-semantic-web-sites-with-curl/.
    14. Dunsire, Gordon, Corey Harper, Diane Hillmann, and Jon Phipps. “Linked Data Vocabulary Management: Infrastructure Support, Data Integration, and Interoperability.” Information Standards Quarterly 24, no. 2/3 (2012): 4. doi:10.3789/isqv24n2-3.2012.02.
    15. Elliott, Thomas, Sebastian Heath, and John Muccigrosso. “Report on the Linked Ancient World Data Institute.” Information Standards Quarterly 24, no. 2/3 (2012): 43. doi:10.3789/isqv24n2-3.2012.08.
    16. Fons, Ted, Jeff Penka, and Richard Wallis. “OCLC’s Linked Data Initiative: Using Schema.org to Make Library Data Relevant on the Web.” Information Standards Quarterly 24, no. 2/3 (2012): 29. doi:10.3789/isqv24n2-3.2012.05.
    17. Hartig, Olaf. “Querying Trust in RDF Data with tSPARQL.” In The Semantic Web: Research and Applications, edited by Lora Aroyo, Paolo Traverso, Fabio Ciravegna, Philipp Cimiano, Tom Heath, Eero Hyvönen, Riichiro Mizoguchi, Eyal Oren, Marta Sabou, and Elena Simperl, 5554:5–20. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009. http://www.springerlink.com/index/10.1007/978-3-642-02121-3_5.
    18. Hartig, Olaf, Christian Bizer, and Johann-Christoph Freytag. “Executing SPARQL Queries over the Web of Linked Data.” In The Semantic Web – ISWC 2009, edited by Abraham Bernstein, David R. Karger, Tom Heath, Lee Feigenbaum, Diana Maynard, Enrico Motta, and Krishnaprasad Thirunarayan, 5823:293–309. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009. http://www.springerlink.com/index/10.1007/978-3-642-04930-9_19.
    19. Heath, Tom, and Christian Bizer. “Linked Data: Evolving the Web into a Global Data Space.” Synthesis Lectures on the Semantic Web: Theory and Technology 1, no. 1 (February 9, 2011): 1–136. doi:10.2200/S00334ED1V01Y201102WBE001.
    20. Isaac, Antoine, Robina Clayphan, and Bernhard Haslhofer. “Europeana: Moving to Linked Open Data.” Information Standards Quarterly 24, no. 2/3 (2012): 34. doi:10.3789/isqv24n2-3.2012.06.
    21. Kobilarov, Georgi, Tom Scott, Yves Raimond, Silver Oliver, Chris Sizemore, Michael Smethurst, Christian Bizer, and Robert Lee. “Media Meets Semantic Web – How the BBC Uses DBpedia and Linked Data to Make Connections.” In The Semantic Web: Research and Applications, edited by Lora Aroyo, Paolo Traverso, Fabio Ciravegna, Philipp Cimiano, Tom Heath, Eero Hyvönen, Riichiro Mizoguchi, Eyal Oren, Marta Sabou, and Elena Simperl, 5554:723–737. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009. http://www.springerlink.com/index/10.1007/978-3-642-02121-3_53.
    22. LiAM. “LiAM: Linked Archival Metadata.” Accessed July 30, 2013. http://sites.tufts.edu/liam/.
    23. “Linked Data.” Wikipedia, the Free Encyclopedia, July 13, 2013. http://en.wikipedia.org/w/index.php?title=Linked_data&oldid=562554554.
    24. “Linked Data Glossary.” Accessed January 1, 2014. http://www.w3.org/TR/ld-glossary/.
    25. “Linked Open Data.” Europeana. Accessed September 12, 2013. http://pro.europeana.eu/web/guest;jsessionid=09A5D79E7474609AE246DF5C5A18DDD4.
    26. “Linked Open Data in Libraries, Archives, & Museums (Google Group).” Accessed August 6, 2013. https://groups.google.com/forum/#!forum/lod-lam.
    27. “Linking Lives | Using Linked Data to Create Biographical Resources.” Accessed August 16, 2013. http://archiveshub.ac.uk/linkinglives/.
    28. “LOCAH Linked Archives Hub Test Dataset.” Accessed August 6, 2013. http://data.archiveshub.ac.uk/.
    29. “LODLAM – Linked Open Data in Libraries, Archives & Museums.” Accessed August 6, 2013. http://lodlam.net/.
    30. “Notation3.” Wikipedia, the Free Encyclopedia, July 13, 2013. http://en.wikipedia.org/w/index.php?title=Notation3&oldid=541302540.
    31. “OWL 2 Web Ontology Language Primer.” Accessed August 14, 2013. http://www.w3.org/TR/2009/REC-owl2-primer-20091027/.
    32. Quilitz, Bastian, and Ulf Leser. “Querying Distributed RDF Data Sources with SPARQL.” In The Semantic Web: Research and Applications, edited by Sean Bechhofer, Manfred Hauswirth, Jörg Hoffmann, and Manolis Koubarakis, 5021:524–538. Berlin, Heidelberg: Springer Berlin Heidelberg. Accessed September 4, 2013. http://www.springerlink.com/index/10.1007/978-3-540-68234-9_39.
    33. “RDF/XML.” Wikipedia, the Free Encyclopedia, July 13, 2013. http://en.wikipedia.org/wiki/RDF/XML.
    34. “RDFa.” Wikipedia, the Free Encyclopedia, July 22, 2013. http://en.wikipedia.org/wiki/RDFa.
    35. “Semantic Web.” Wikipedia, the Free Encyclopedia, August 2, 2013. http://en.wikipedia.org/w/index.php?title=Semantic_Web&oldid=566813312.
    36. “SPARQL.” Wikipedia, the Free Encyclopedia, August 1, 2013. http://en.wikipedia.org/w/index.php?title=SPARQL&oldid=566718788.
    37. “SPARQL 1.1 Overview.” Accessed August 6, 2013. http://www.w3.org/TR/sparql11-overview/.
    38. “Spring/Summer 2012 (v.24 No.2/3) – National Information Standards Organization.” Accessed August 6, 2013. http://www.niso.org/publications/isq/2012/v24no2-3/.
    39. Summers, Ed, and Dorothea Salo. Linking Things on the Web: A Pragmatic Examination of Linked Data for Libraries, Archives and Museums. ArXiv e-print, February 19, 2013. http://arxiv.org/abs/1302.4591.
    40. “The Linking Open Data Cloud Diagram.” Accessed November 3, 2013. http://lod-cloud.net/.
    41. “The Trouble with Triples | Duke Collaboratory for Classics Computing (DC3).” Accessed November 6, 2013. http://blogs.library.duke.edu/dcthree/2013/07/27/the-trouble-with-triples/.
    42. Tim Berners-Lee, James Hendler, and Ora Lassila. “The Semantic Web.” Accessed September 4, 2013. http://www.scientificamerican.com/article.cfm?id=the-semantic-web.
    43. “Transforming EAD XML into RDF/XML Using XSLT.” Accessed August 16, 2013. http://archiveshub.ac.uk/locah/tag/transform/.
    44. “Triplestore – Wikipedia, the Free Encyclopedia.” Accessed November 11, 2013. http://en.wikipedia.org/wiki/Triplestore.
    45. “Turtle (syntax).” Wikipedia, the Free Encyclopedia, July 13, 2013. http://en.wikipedia.org/w/index.php?title=Turtle_(syntax)&oldid=542183836.
    46. Volz, Julius, Christian Bizer, Martin Gaedke, and Georgi Kobilarov. “Discovering and Maintaining Links on the Web of Data.” In The Semantic Web – ISWC 2009, edited by Abraham Bernstein, David R. Karger, Tom Heath, Lee Feigenbaum, Diana Maynard, Enrico Motta, and Krishnaprasad Thirunarayan, 5823:650–665. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009. http://www.springerlink.com/index/10.1007/978-3-642-04930-9_41.
    47. Voss, Jon. “LODLAM State of Affairs.” Information Standards Quarterly 24, no. 2/3 (2012): 41. doi:10.3789/isqv24n2-3.2012.07.
    48. W3C. “LinkedData.” Accessed August 4, 2013. http://www.w3.org/wiki/LinkedData.
    49. “Welcome to Euclid.” Accessed September 4, 2013. http://www.euclid-project.eu/.
    50. “Wiki.dbpedia.org : About.” Accessed November 3, 2013. http://dbpedia.org/About.

    by Eric Lease Morgan at January 09, 2014 02:19 AM

    January 06, 2014

    LiAM: Linked Archival Metadata

    Publishing archival descriptions as linked data via databases

    Publishing linked data through XML transformation is functional but not optimal. Publishing linked data from a database comes closer to the ideal but requires a greater amount of technical computer infrastructure and expertise.

    Databases — specifically, relational databases — are the current best practice for organizing data. As you may or may not know, relational databases are made up of many tables of data joined with keys. For example, a book may be assigned a unique identifier. The book has many characteristics such as a title, number of pages, size, descriptive note, etc. Some of the characteristics are shared by other books, like authors and subjects. In a relational database these shared characteristics would be saved in additional tables, and they would be joined to a specific book through the use of unique identifiers (keys). Given this sort of data structure, reports can be created from the database describing its content. Similarly, queries can be applied against the database to uncover relationships that may not be apparent at first glance or buried in reports. The power of relational databases lay in the use of keys to make relationships between rows in one table and rows in other tables.

    databases to linked data

    relational databases and lilnked data

    Not coincidently, this is very much the way linked data is expected to be implemented. In the linked data world, the subjects of triples are URIs (think database keys). Each URI is associated with one or more predicates (think the characteristics in the book example). Each triple then has an object, and these objects take the form of literals or other URIs. In the book example, the object could be “Adventures Of Huckleberry Finn” or a URI pointing to Mark Twain. The reports of relational databases are analogous to RDF serializations, and SQL (the relational database query language) is analogous to SPARQL, the query language of RDF triple stores. Because of the close similarity between well-designed relational databases and linked data principles, the publishing of linked data directly from relational databases makes whole lot of sense, but the process requires the combined time and skills of a number of different people: content specialists, database designers, and computer programmers. Consequently, the process of publishing linked data from relational databases may be optimal, but it is more expensive.

    Thankfully, most archivists probably use some sort of database to manage their collections and create their finding aids. Moreover, archivists probably use one of three or four tools for this purpose: Archivist’s Toolkit, Archon, ArchivesSpace, or PastPerfect. Each of these systems have a relational database at their heart. Reports could be written against the underlying databases to generate serialized RDF and thus begin the process of publishing linked data. Doing this from scratch would be difficult, as well as inefficient because many people would be starting out with the same database structure but creating a multitude of varying outputs. Consequently, there are two alternatives. The first is to use a generic database application to RDF publishing platform called D2RQ. The second is for the community to join together and create a holistic RDF publishing system based on the database(s) used in archives.

    D2RQ is a wonderful software system. It is supported, well-documented, executable on just about any computing platform, open source, focused, functional, and at the same time does not try to be all things to all people. Using D2RQ it is more than possible to quickly and easily publish a well-designed relational database as RDF. The process is relatively simple:

    1. download the software
    2. use a command-line utility to map the database structure to a configuration file
    3. season the configuration file to taste
    4. run the D2RQ server using the configuration file as input thus allowing people or RDF user-agents to search and browse the database using linked data principles
    5. alternatively, dump the contents of the database to an RDF serialization and upload the result into your favorite RDF triple store

    For a limited period of time I have implemented D2RQ against my water collection (original HTML or linked data). Of particular interest is the list of classes (ontologies) and properties (terms) generated from the database by D2RQ. Here is a URI pointing to a particular item in the collection — Atlantic Ocean at Roch in Wales (original HTML or linked data).

    The downside of D2RQ is its generic nature. It will create an RDF ontology whose terms correspond to the names of database fields. These field names do not map to widely accepted ontologies and therefore will not interact well with communities outside the ones using a specific database structure. Still, the use of D2RQ is quick, easy, and accurate.

    The second alternative to using databases of archival content to published linked data requires community effort and coordination. The databases of Archivist’s Toolkit, Archon, ArchivesSpace, or PastPerfect could be assumed. The community could then get together and create and decide on an RDF ontology to use for archival descriptions. The database structure(s) could then be mapped to this ontology. Next, programs could be written against the database(s) to create serialized RDF thus beginning the process of publishing linked data. Once that was complete, the archival community would need to come together again to ensure it uses as many shared URIs as possible thus creating the most functional sets of linked data. This second alternative requires a significant amount of community involvement and wide-spread education. It represents a never-ending process.

    by Eric Lease Morgan at January 06, 2014 08:14 PM

    January 05, 2014

    LiAM: Linked Archival Metadata

    Publishing linked data by way of EAD files

    [This blog posting comes from a draft of the Linked Archival Metadata: A Guidebook --ELM ]

    If you have used EAD to describe your collections, then you can easily make your descriptions available as valid linked data, but the result will be less than optimal. This is true not for a lack of technology but rather from the inherent purpose and structure of EAD files.

    A few years ago an organisation in the United Kingdom called the Archive’s Hub was funded by a granting agency called JISC to explore the publishing of archival descriptions as linked data. One of the outcomes of this effort was the creation of an XSL stylesheet transforming EAD into RDF/XML. The terms used in the stylesheet originate from quite a number of standardized, widely accepted ontologies, and with only the tiniest bit configuration / customization the stylesheet can transform a generic EAD file into valid RDF/XML. The resulting XML files can then be made available on a Web server or incorporated into a triple store. This goes a long way to publishing archival descriptions as linked data. The only additional things needed are a transformation of EAD into HTML and the configuration of a Web server to do content-negotiation between the XML and HTML.

    For the smaller archive with only a few hundred EAD files whose content does not change very quickly, this is a simple, feasible, and practical solution to publishing archival descriptions as linked data. With the exception of doing some content-negotiation, this solution does not require any computer technology that is not already being used in archives, and it only requires a few small tweaks to a given workflow:

    1. implement a content-negotiation solution
    2. edit EAD file
    3. transform EAD into RDF/XML
    4. transform EAD into HTML
    5. save the resulting XML and HTML files on a Web server
    6. go to step #2

    On the other hand an EAD file is the combination of a narrative description with a hierarchal inventory list, and this data structure does not lend itself very well to the triples of linked data. For example, EAD headers are full of controlled vocabularies terms but there is no way to link these terms with specific inventory items. This is because the vocabulary terms are expected to describe the collection as a whole, not individual things. This problem could be overcome if each individual component of the EAD were associated with controlled vocabulary terms, but this would significantly increase the amount of work needed to create the EAD files in the first place.

    The common practice of using literals (“strings”) to denote the names of people, places, and things in EAD files would also need to be changed in order to fully realize the vision of linked data. Specifically, it would be necessary for archivists to supplement their EAD files with commonly used URIs denoting subject headings and named authorities. These URIs could be inserted into id attributes throughout an EAD file, and the resulting RDF would be more linkable, but the labor to do so would increase, especially since many of the named authorities will not exist in standardized authority lists.

    Despite these short comings, transforming EAD files into some sort of serialized RDF goes a long way towards publishing archival descriptions as linked data. This particular process is a good beginning and outputs valid information, just information that is not as accurate as possible. This process lends itself to iterative improvements, and outputting something is better than outputting nothing. But this particular process is not for everybody. The archive whose content changes quickly, the archive with copious numbers of collections, or the archive wishing to publish the most accurate linked data possible will probably not want to use EAD files as the root of their publishing system. Instead some sort of database application is probably the best solution.

    by Eric Lease Morgan at January 05, 2014 11:08 PM

    December 30, 2013

    Life of a Librarian

    Semantic Web in Libraries 2013

    I attended the Semantic Web in Libraries 2013 conference in Hamburg (November 25-27), and this posting documents some of my experiences. In short, I definitely believe the linked data community in libraries is maturing, but I still wonder whether or not barrier for participation is really low enough for the vision of the Semantic Web to become a reality.

    venue

    Preconference on provenance

    On the first day I attended a preconference about linked data and provenance led by Kai Eckert (University of Mannheim) and Magnus Pfeffer (Stuttgarat Media University). One of the fundamental ideas behind the Semantic Web and linked data is the collecting of triples denoting facts. These triples are expected to be amassed and then inferenced across in order to bring new knowledge to light. But in the scholarly world it is important cite and attribute scholarly output. Triples are atomistic pieces of information: subjects, predicates, objects. But there is no room in these simple assertions to denote where the information originated. This issue was the topic of the preconference discussion. Various options were outlined but none of them seemed optimal. I’m not sure of the conclusion, but one “solution” may be the use of PROV, “a model, corresponding serializations and other supporting definitions to enable the inter-operable interchange of provenance information in heterogeneous environments such as the Web”.

    castle

    Day #1

    Both Day #1 and Day #2 were peppered with applications which harvested linked data (and other content) to create new and different views of information. AgriVIVO, presented by John Fereira (Cornell University) was a good example:

    AgriVIVO is a search portal built to facilitate connections between all actors in the agricultural field, bridging across separately hosted directories and online communities… AgriVIVO is based on the VIVO open source semantic web application initially developed at Cornell University and now adopted by several cross-institutional research discovery projects.

    Richard Wallis (OCLC) advocated the creation of library knowledge maps similar to the increasingly visible “knowledge graphs” created by Google and displayed at the top of search results. These “graphs” are aggregations of images, summaries, maps, and other bit of information providing the reader with answers / summaries describing what may be the topic of search. They are the same sort of thing one sees when searches are done in Facebook as well. And in the true spirit of linked data principles, Wallis advocated the additional use of additional peoples’ Semantic Web ontologies such as the ontology used by Schema.org. If you want to participate and help extend the bibliographic entities of Schema.org, then consider participating in a W3C Community called Schema Bib Extend Community Group.

    BIBFRAME was described by Julia Hauser (Reinhold Heuvelmann German National Library). Touted as as a linked data replacement for MARC, its data model consists of works, instances, authorities, and annotations (everything else). According to Hauser, “The big unknown is how can RDA or FRBR be expressed using BIBFRAME.” Personally, I noticed how BIBFRAME contains no holdings information, but such an issue may be resolvable through the use of schema.org.

    “Language effects hierarchies and culture comes before language” were the concluding remarks in a presentation by the National Library of Finland. Leaders in the linked data world, the presenters described how they were trying to create a Finnish ontology, and they demonstrated how language does not fit into neat and orderly hierarchies and relationships. Things always get lost in translation. For example, one culture may have a single word for a particular concept, but another culture may have multiple words because the concept has more nuances in its experience. Somewhere along the line the presenters alluded to onki-light, “a REST-style API for machine and Linked Data access to the underlying vocabulary data.” I believe the presenters were using this tool to support access to their newly formed ontology.

    Yet another ontology was described by Carsten Klee (Berlin State Library) and Jakob Voẞ (GBV Common Library Network). This was a holdings ontology which seemed unnecessarily complex to me, but then I’m no real expert. See the holding-ontology repository on Github.

    memorial

    Day #2

    I found the presentation — “Decentralization, distribution, disintegration: Towards linked data as a first class citizen in Library Land” — by Martin Malmsten (National Library of Sweden) to be the most inspiring. In the presentation he described why he thinks linked data is the way to describe the content of library catalogs. He also made insightful distinctions between file formats and the essencial characteristics of data, information, knowledge, (and maybe wisdom). Like many at the conference, he advocated interfaces to linked data, not MARC:

    Working with RDF has enabled me to see beyond simple formats and observe the bigger picture — “Linked data or die”. Linked data is the way to do it now. I advocate the abstraction of MARC to RDF because RDF is more essencial and fundmental… Mixing data is new problem with the advent of linked data. This represents a huge shift in our thinking of Library Land. It is transformative… Keep the formats (monsters and zombies) outside your house. Formats are for exchange. True and real RDF is not a format.

    Some of the work demonstrating the expressed ideas of the presentation is available on Github in a package called librisxl.

    Another common theme / application demonstrated at the conference were variations of the venerable library catalog. OpenCat, presented by Agnes Simon (Bibliothéque Nationale de France), was an additional example of this trend. Combining authority data (available as RDF) provided by the National Library of France with works of a second library (Fresnes Public Library), the OpenCat prototype provides quite an interesting interface to library holdings.

    Peter Király (Europeana Foundation) described how he is collecting content over many protocols and amalgamating it into the data store of Europenana. I appreciated the efforts he has made to normalize and enrich the data — not an easy task. The presentation also made me think about provenance. While provenance is important, maybe trust of provenance can come from the aggregator. I thought, “If these aggregators believe — trust — the remote sources, then may be I can too.” Finally, the presentation got me imagining how one URI can lead to others, and my goal would be to distill it down again into a single URI all of the interesting information I found a long the way, as in the following image I doodled during the presentation.

    uri

    Enhancing the access and functionality of manuscripts was the topic of the presentation by Kai Eckert (Universität Mannheim). Specifically, manuscripts are digitized and an interface is placed on top allowing scholars to annotate the content beneath. I think the application supporting this functionality is called Pundit. Along the way he takes heterogeneous (linked) data and homogenizes it with a tool called DM2E.

    OAI-PMH was frequently alluded to during the conference, and I have some ideas about that. In “Application of LOD to enrich the collection of digitized medieval manuscripts at the University of Valencia” Jose Manuel Barrueco Cruz (University of Valencia) described how the age of his content inhibited his use of the currently available linked data. I got the feeling there was little linked data closely associated with the subject matter of his manuscripts. Still, an an important thing to note, is how he started his investigations with the use of Datahub:

    a data management platform from the Open Knowledge Foundation, based on the CKAN data management system… [providing] free access to many of CKAN’s core features, letting you search for data, register published datasets, create and manage groups of datasets, and get updates from datasets and groups you’re interested in. You can use the web interface or, if you are a programmer needing to connect the Datahub with another app, the CKAN API.

    Simeon Warner (Cornell University) described how archives or dumps of RDF triple stores are synchronized across the Internet via HTTP GET, gzip, and a REST-ful interface on top of Google sitemaps. I was impressed because the end result did not necessarily invent something new but rather implemented an elegant solution to a real-world problem using existing technology. See the resync repository on Github.

    In “From strings to things: A linked data API for library hackers and Web developers” Fabian Steeg and Pascal Christoph (HBZ) described an interface allowing librarians to determine the URIs of people, places, and things for library catalog records. “How can we benefit from linked data without being linked data experts? We want to pub Web developers into focus using JSON for HTTP.” There are few hacks illustrating some of their work on Github in the lobid repository.

    Finally, I hung around for a single lightning talk — Carsten Klee’s (Berlin State Library) presentation of easyM2R, a PHP script converting MARC to any number of RDF serializations.

    church

    Observations, summary, and conclusions

    I am currently in the process of writing a short book on the topic of linked data and archives for an organization called LiAM — “a planning grant project whose deliverables will facilitate the application of linked data approaches to archival description.” One of my goals for attending this conference was to determine my level of understanding when it comes to linked data. At the risk of sounding arrogant, I think I’m on target, but at the same time, I learned a lot at this conference.

    For example, I learned that the process of publishing linked data is not “rocket surgery” and what I have done to date is more than functional, but I also learned that creating serialized RDF from MARC or EAD is probably not the best way to create RDF. I learned that publishing linked data is only one half of the problem to be solved. The other half is figuring out ways to collect, organize, and make useful the published content. Fortunately this second half of the problem was much of what the conference was about. Many people are using linked data to either create or enhance “next-generation library catalogs”. In this vein they are not really doing anything new and different; they are being evolutionary. Moreover, many of the developers are aggregating content using quite a variety of techniques, OAI-PMH being one of the more frequent.

    When it comes to OAI-PMH and linked data, I see very much the same vision. Expose metadata in an agreed upon format and in an agreed upon method. Allow others to systematically harvest the metadata. Provide information services against the result. OAI-PMH was described as protocol with a low barrier to entry. The publishing of linked data is also seen as low barrier technology. The challenges of both first lie the vocabularies used to describe the things of the metadata. OAI-PMH required Dublin Core but advocated additional “ontologies”. Few people implemented them. Linked data is not much different. The problem with the language of the things is just as prevalent, if not more so. Linked data is not just the purview of Library Land and a few computer scientists. Linked data has caught the attention of a much wider group of people, albiet the subject is still a bit esoteric. I know the technology supporting linked data functions. After all, it is the technology of the Web. I just wonder if: 1) there will ever be a critical mass of linked data available in order to fulfill its promise, and 2) will we — the information community — be able to overcome the “Tower of Babel” we are creating with all the the various ontologies we are sporting. A single ontology won’t work. Just look at Dublin Core. Many ontologies won’t work either. There is too much variation and too many idiosyncrasies in real-world human language. I don’t know what the answer is. I just don’t.

    Despite some of my misgivings, I think the following quote by Martin Malmsten pretty much sums up much of the conference — Linked data or die!

    by Eric Lease Morgan at December 30, 2013 04:48 AM

    Date created: 2000-05-19
    Date updated: 2011-05-03
    URL: http://infomotions.com/