Friday, March 23, 2012

Text mining with perl: Resources

One of the exciting area of the analytics movement is text mining, where you mine relevant information from unstructured text. This is, hence, different from data mining, where you mine data for patterns. It is generally accepted that text mining is considerably harder than data mining, mostly because the source is unstructured. but the benefits are often substantial - for example, there is an investment fund which uses analysis of twitter feeds for predicting stock movement. This question on a quant Q&A site discusses a few similar applications. This is just one example application. Another example is IBM's Watson, who famously won the jeopardy in February of 2011.

Text mining and natural language processing
Text mining has close association with natural language processing (NLP). Since you have to search for information in unstructured text which are mostly text in natural languages, you end up using quite a few of the algorithms used in NLP. Recently, I have been moving a bit deeper into text mining and NLP, for a side project of mine (maybe a later blog post, but let me tell you, it is not for investment purposes :-).  Text mining and NLP is a good fit for me (I am a developer of a Monte Carlo simulation software called Oracle Crystal Ball in my day job) - they use a lot of probabilities and modeling, similar to what I do for Monte Carlo simulation. There isn't much optimization and O.R. yet, but it is analytics all the way. 

To that end, I am learning the tricks of the game, and a suitable language to program.

Perl for data extraction
Although I mostly work in the .NET world, often times, for some processing jobs, I revert to perl. I have used perl for many years, and love the language. Initially, I learned perl to do some simple web-CGI programming when I was in IIT Kharagpur, in my undergraduate years. Then I used perl for heavy database scripting at a job I did immediately following my undergraduate degree. I am by no means an expert in perl, but I know decent amount to get by. Since then, I have used perl in a variety of capacity: in my current work, in hobby web-CGI programming and so on. Although, lot of people have moved on to php for web-CGI and python for non-CGI scripting, I never could do so. There is nothing wrong with perl anyway, why bother?

I also think perl is a great language for people who deal with data. As a software developer for an analytic tools, I deal with data all the time, and lots of them. The strength of perl comes from its rich feature set suitable for data extraction (indeed, perl is, unofficially, an acronym for Practical Extraction and Reporting Language). I have used perl (along with unix shell e.g., bash, sed and awk) scripts for extracting data and formatting the way I need with great success.

Perl for text mining
When you think about the strengths of perl, it naturally becomes a contender for the language of choice for NLP. Perl has strong but easy to use regular expressions, and various modules for extracting anything from anywhere. Given that, and my familiarity with perl, I decided to take a look at the resources available for this purpose. Not so surprisingly, there are quite a few.

Resources for text mining
Here is my compiled list of resources, with an obvious bias towards using perl.
  • Books: Let's start with books. There are no dearth of books on text mining and NLP, but I will list a few ones I thought are good.
    • Manning and Schutze. Foundations of Statistical Natural Language Processing. The MIT Press. 1st Ed. 1999. This book is a gem, sorta bible for modern NLP. It has got nothing to do with perl, but since text mining has lot of NLP stuff, this book is a great reference.
    • Manu Konchady. Text Mining Application Programming. Charles River Media. 1st Ed. 2006. This is probably the best book out there to learn text mining using perl. Note that the book is not that heavy on perl code though, but stresses on developing the concepts and the models that are used in text mining. The book covers text mining by talking about various problems that appear in (and can be solved using) text mining. I liked that approach. It also helps that the author maintains an open source library for text mining written in perl.
    • Roger Bilisoly. Practical Text Mining with Perl. Wiley. 1st Ed. 2008. This is probably the only book that directly deals with both text mining and perl. This is also probably a pretty good book, but unfortunately it did not meet my exact needs. I found the topic covered somewhat lacking when compared to Konchady's book.
    • Weiss et. al. Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer. 1st Ed. 2005 (Softcover - 2010). The contents of this book seemed interesting, similar to Konchady's book but the treatment more rigorous. I have put this in my "to check out" list.
  • Software: As with books, there are no dearth of software for text mining either. Since I was looking for perl related solutions, those are the only ones I list here.
    • Text mine: The open source perl library I mentioned earlier.
    • Text::Mining module on CPAN: This module seems comprehensive, but is rather poorly documented. It is not very clear from the documentation available in the module about how to use various sub-modules effectively. Hope the authors work on that.
  • Video learning: There are quite a few lectures available on Youtube. But the best one has to be an online course from Stanford, which is going on right now. The course is from Prof. Manning, the author of the first book in my list. The online course instructs the assignments to be submitted in java or python, and I understand their logic - those two being at the top of the heap of popular languages, but I have decided to use perl for the purpose of doing the assignments. I won't be graded then, but hey, I did not want to be graded in the first place ;-). The reason for choosing perl: my familiarity and knowledge of the strengths of perl in this area (although, I am familiar with java too, at almost the same level), the lure of being able to use the same code in my side project, and possibly reuse some of that stuff in a web-CGI context at some point.
I will update this post if I come across more relevant resources.


  1. Unless Perl is the only language you know or the functionality you need is in *existing* CPAN code, you're much better off in Python or Java. Python has the Natural Language Toolkit, NumPy, SciPy and Matplotlib. Java has a boatload of existing NLP and machine learning code.

    Perl is a great language for *text* processing, but for natural language processing and machine learning it has fallen behind Python and Java. It's a timing thing; Perl got object-orientation and CPAN after Java was well established as the language for practical work in heavy-duty AI, and then Python came along and was easier to learn than either Perl or Java.

  2. I completely agree with you from the language perspective. My reasoning though, is, more for learning purposes and reusing code in web-CGI context. Most shared hosting services allow perl or php, but rarely java or python for CGI scripting. I know a decent amount of languages, but the selection was limited due to context, accompanied by my liking for perl. Had I been looking to choose an off-the-shelf software to get some NLP done, I think my choice might have been different.

  3. Very nice article, I used some of your useful concepts to write my own article: Self-Service Business Intelligence and Natural Language Reporting. I'd love to get your feedback. Don't you think that Natural Language Business Intelligence is the next step of Business Intelligence's evolution?

  4. Hi Tim, good luck with your blog. Yes - NLP can be pretty big for reporting, if executed properly.

  5. Interesting post, Tim. Just curious if you've heard of the cutting-edge enterprise data modeling solution from Modern Analytics? I think it's something that you may be interested in, especially if data mining is your field of expertise.