April 12, 2015

Mini-musings

Marrying close and distant reading: A THATCamp project

The purpose of this page is to explore and demonstrate some of the possibilities of marrying close and distant reading. By combining both of these processes there is a hope greater comprehension and understanding of a corpus can be gained when compared to using close or distant reading alone. (This text might also be republished at http://dh.crc.nd.edu/sandbox/thatcamp-2015/ as well as http://nd2015.thatcamp.org/2015/04/07/close-and-distant/.)

To give this exploration a go, two texts are being used to form a corpus: 1) Machiavelli’s The Prince and 2) Emerson’s Representative Men. Both texts were printed and bound into a single book (codex). The book is intended to be read in the traditional manner, and the layout includes extra wide margins allowing the reader to liberally write/draw in the margins. As the glue is drying on the book, the plain text versions of the texts were evaluated using a number of rudimentary text mining techniques and with the results made available here. Both the traditional reading as well as the text mining are aimed towards answering a few questions. How do both Machiavelli and Emerson define a “great” man? What characteristics do “great” mean have? What sorts of things have “great” men accomplished?

Comparison

Feature The Prince Representative Men
Author Niccolò di Bernardo dei Machiavelli (1469 – 1527) Ralph Waldo Emerson (1803 – 1882)
Title The Prince Representative Men
Date 1532 1850
Fulltext plain text | HTML | PDF | TEI/XML plain text | HTML | PDF | TEI/XML
Length 31,179 words 59,600 words
Fog score 23.1 14.6
Flesch score 33.5 52.9
Kincaid score 19.7 11.5
Frequencies unigrams, bigrams, trigrams, quadgrams, quintgrams unigrams, bigrams, trigrams, quadgrams, quintgrams
Parts-of-speech nouns, pronouns, adjectives, verbs, adverbs nouns, pronouns, adjectives, verbs, adverbs

Search

Search for “man or men” in The Prince. Search for “man or men” in Representative Men.

Observations

I observe this project to be a qualified success.

First, I was able to print and bind my book, and while the glue is still trying, I’m confident the final results will be more than usable. The real tests of the bound book are to see if: 1) I actually read it, 2) I annotate it using my personal method, and 3) if I am able to identify answers to my research questions, above.


bookmaking tools

almost done

Second, the text mining services turned out to be more of a compare & contrast methodology as opposed to a question-answering process. For example, I can see that one book was written hundreds of years before the other. The second book is almost twice as long and the first. Readability score-wise, Machiavelli is almost certainly written for the more educated and Emerson is easier to read. The frequencies and parts-of-speech are enumerative, but not necessarily illustrative. There are a number of ways the frequencies and parts-of-speech could be improved. For example, just about everything could be visualized into histograms or word clouds. The verbs ought to lemmatized. The frequencies ought to be depicted as ratios compared to the texts. Other measures could be created as well. For example, my Great Books Coefficient could be employed.

How do Emerson and Machiavelli define a “great” man. Hmmm… Well, I’m not sure. It is relatively easy to get “definitions” of men in both books (The Prince or Representative Men). And network diagrams illustrating what words are used “in the same breath” as the word man in both works are not very dissimilar:


“man” in The Prince

“man” in Representative men

I think I’m going to have to read the books to find the answer. Really.

Code

Bunches o’ code was written to produce the reports:

  • concordance.cgi – the simple search engine
  • fathom.pl – used to compute the readability scores
  • file2pos.py – create a parts-of-speech file for later use
  • network.cgi – used to display words used “in the same breath” a given word
  • ngrams.pl – compute ngrams
  • pos.py – count and tabulate parts-of-speech from a previously created file

You can download this entire project — code and all — from http://dh.crc.nd.edu/sandbox/thatcamp-2015/reports/thatcamp-2015.tar.gz or http://infomotions.com/blog/wp-content/uploads/2015/04/thatcamp-2015.tar.gz.

by Eric Lease Morgan at April 12, 2015 04:47 PM

March 11, 2015

Life of a Librarian

Text files

While a rose is a rose is a rose, a text file is not a text file is not a text file.

For better or for worse, we here in our text analysis workshop are dealing with three different computer operating systems: Windows, Macintosh, and Linux. Text mining requires the subject of its analysis to be in the form of plain text files. [1] But there is a subtle difference between the ways each of our operating systems expect to deal with “lines” in that text. Let me explain.

Imagine a classic typerwriter. A cylinder (called a “platten”) fit into a “carriage” designed to move back & forth across a box while “keys” were slapped against a piece of inked ribbon ultimately imprinting a character on a piece of paper rolled around the platten. As each key was pressed the platten moved a tiny bit from right to left. When the platten got to the left-most position, the operator was expected to manually move the platten back to the right-most postion and continue typing. This movement was really two movements in one. First, the carriage was “returned” to the right-most position, and second, the platten was rolled one line up. (The paper was “fed” around the platten by one line.) If one or the other of these two movements were not performed, then the typing would either run off the right-hand side of the paper, or the letters would be imprinted on top of the previously typed characters. These two movements are called “carriage returns” and “line feeds”, respectively.

Enter computers. Digital representations of characters were saved to files. These files are then sent to printers, but there is no person there to manually move the platten from left to right nor to roll the paper further into the printer. Instead, invisible characters were created. There are many invisible characters, and the two of most interest to us are carriage return (ASCII character 13) and line feed (sometimes called “new line” and ASCII character 10). [2] When the printer received these characters the platten moved accordingly.

Enter our operating systems. For better or for worse, traditionally each of our operating systems treat the definition of lines differently:

  • in a traditional Macintosh file lines are delimited by a single carriage return (ASCII 13)
  • on Unix/Linux lines are delimited by line feeds (ASCII 10)
  • Windows computers expect lines to be delimited by a combination of both (ASCII 13 and ASCII 10)

Go figure?

Macintosh is much more like Unix now-a-days, so most Macintosh text files use the Unix convention.

Windows folks, remember how your text files looked funny when initially displayed? This is because the original text files only contained ASCII 10 and not ASCII 13. Notepad, your default text editor, did not “see” line feed characters and consequently everything looked funny. Years ago, if a Macintosh computer read a Unix/Linux text file, then all the letters would be displayed on top of each other, even messier.

If you create a text file on your Windows or (older) Macintosh computer, and then you use these files as input to other programs (ie., wget -i ./urls.txt), then the operation may fail because the programs may not know how a line is denoted in the input.

Confused yet? In any event, text files are not text files are not text files. And the solution to this problem is to use full-featured text editor — the subject of another essay.

[1] plain text files explained – http://en.wikipedia.org/wiki/Plain_text
[2] intro’ to ASCII – http://www.theasciicode.com.ar

by Eric Lease Morgan at March 11, 2015 06:42 PM

January 09, 2015

Life of a Librarian

Hands-on text analysis workshop

I have all but finished writing a hands-on text analysis workshop. From the syllabus:

The purpose of this 5­-week workshop is to increase the knowledge of text mining principles among participants. By the end of the workshop, students will be able to describe the range of basic text mining techniques (everything from the creation of a corpus, to the counting/tabulating of words, to classification & clustering, and visualizing the results of text analysis) and have garnered hands­-on experience with all of them. All the materials for this workshop are available online. There are no prerequisites except for two things: 1) a sincere willingness to learn, and 2) a willingness to work at a computer’s command line interface. Students are really encouraged to bring their own computers to class.

The workshop is divided into the following five, 90-minute sessions, one per week:

  1. Overview of text mining and working from the command line
  2. Building a corpus
  3. Word and phrase frequencies
  4. Extracting meaning with dictionaries, parts­of­speech analysis, and named entity recognition
  5. Classification and topic modeling

For better or for worse, the workshop’s computing environment will be the Linux command line. Besides the usual command-line suspects, participants will get their hands dirty with wget, tika, a bit of Perl, a lot of Python, Wordnet, Treetagger, Standford’s Named Entity Recognizer, and Mallet.

For more detail, see the syllabus, sample code, and corpus.

by Eric Lease Morgan at January 09, 2015 04:42 PM

distance.cgi – My first Python-based CGI script

Yesterday I finished writing my first Python-based CGI script — distance.cgi. Given two words, it allows the reader to first disambiguate between various definitions of the words, and second, uses Wordnet’s network to display various relationships (distances) between the resulting “synsets”. (Source code is here.)

Reader input

Disambiguate

Display result

The script relies on Python’s Natural Language Toolkit (NLTK) which provides an enormous amount of functionality when it comes to natural language processing. I’m impressed. On the other hand, the script is not zippy, and I am not sure how performance can be improved. Any hints?

by Eric Lease Morgan at January 09, 2015 04:10 PM

January 01, 2015

Mini-musings

Great Books Survey

I am happy to say that the Great Books Survey is still going strong. Since October of 2010 it has been answered 24,749 times by 2,108 people from people all over the globe. To date, the top five “greatest” books are Athenian Constitution by Aristotle, Hamlet by Shakespeare, Don Quixote by Cervantes, Odyssey by Homer, and the Divine Comedy by Dante. The least “greatest” books are Rhesus by Euripides, On Fistulae by Hippocrates, On Fractures by Hippocrates, On Ulcers by Hippocrates, On Hemorrhoids by Hippocrates. “Too bad Hippocrates”.

For more information about this Great Books of the Western World investigation, see the various blog postings.

by Eric Lease Morgan at January 01, 2015 03:55 PM

Date created: 2000-05-19
Date updated: 2011-05-03
URL: http://infomotions.com/