I have all but finished writing a hands-on text analysis workshop. From the syllabus:
The purpose of this 5-week workshop is to increase the knowledge of text mining principles among participants. By the end of the workshop, students will be able to describe the range of basic text mining techniques (everything from the creation of a corpus, to the counting/tabulating of words, to classification & clustering, and visualizing the results of text analysis) and have garnered hands-on experience with all of them. All the materials for this workshop are available online. There are no prerequisites except for two things: 1) a sincere willingness to learn, and 2) a willingness to work at a computer’s command line interface. Students are really encouraged to bring their own computers to class.
The workshop is divided into the following five, 90-minute sessions, one per week:
- Overview of text mining and working from the command line
- Building a corpus
- Word and phrase frequencies
- Extracting meaning with dictionaries, partsofspeech analysis, and named entity recognition
- Classification and topic modeling
For better or for worse, the workshop’s computing environment will be the Linux command line. Besides the usual command-line suspects, participants will get their hands dirty with wget, tika, a bit of Perl, a lot of Python, Wordnet, Treetagger, Standford’s Named Entity Recognizer, and Mallet.