Friday, March 23, 2012

Text mining with perl: Resources

One of the exciting area of the analytics movement is text mining, where you mine relevant information from unstructured text. This is, hence, different from data mining, where you mine data for patterns. It is generally accepted that text mining is considerably harder than data mining, mostly because the source is unstructured. but the benefits are often substantial - for example, there is an investment fund which uses analysis of twitter feeds for predicting stock movement. This question on a quant Q&A site discusses a few similar applications. This is just one example application. Another example is IBM's Watson, who famously won the jeopardy in February of 2011.

Text mining and natural language processing
Text mining has close association with natural language processing (NLP). Since you have to search for information in unstructured text which are mostly text in natural languages, you end up using quite a few of the algorithms used in NLP. Recently, I have been moving a bit deeper into text mining and NLP, for a side project of mine (maybe a later blog post, but let me tell you, it is not for investment purposes :-).  Text mining and NLP is a good fit for me (I am a developer of a Monte Carlo simulation software called Oracle Crystal Ball in my day job) - they use a lot of probabilities and modeling, similar to what I do for Monte Carlo simulation. There isn't much optimization and O.R. yet, but it is analytics all the way. 

To that end, I am learning the tricks of the game, and a suitable language to program.

Perl for data extraction
Although I mostly work in the .NET world, often times, for some processing jobs, I revert to perl. I have used perl for many years, and love the language. Initially, I learned perl to do some simple web-CGI programming when I was in IIT Kharagpur, in my undergraduate years. Then I used perl for heavy database scripting at a job I did immediately following my undergraduate degree. I am by no means an expert in perl, but I know decent amount to get by. Since then, I have used perl in a variety of capacity: in my current work, in hobby web-CGI programming and so on. Although, lot of people have moved on to php for web-CGI and python for non-CGI scripting, I never could do so. There is nothing wrong with perl anyway, why bother?

I also think perl is a great language for people who deal with data. As a software developer for an analytic tools, I deal with data all the time, and lots of them. The strength of perl comes from its rich feature set suitable for data extraction (indeed, perl is, unofficially, an acronym for Practical Extraction and Reporting Language). I have used perl (along with unix shell e.g., bash, sed and awk) scripts for extracting data and formatting the way I need with great success.

Perl for text mining
When you think about the strengths of perl, it naturally becomes a contender for the language of choice for NLP. Perl has strong but easy to use regular expressions, and various modules for extracting anything from anywhere. Given that, and my familiarity with perl, I decided to take a look at the resources available for this purpose. Not so surprisingly, there are quite a few.

Resources for text mining
Here is my compiled list of resources, with an obvious bias towards using perl.
  • Books: Let's start with books. There are no dearth of books on text mining and NLP, but I will list a few ones I thought are good.
    • Manning and Schutze. Foundations of Statistical Natural Language Processing. The MIT Press. 1st Ed. 1999. This book is a gem, sorta bible for modern NLP. It has got nothing to do with perl, but since text mining has lot of NLP stuff, this book is a great reference.
    • Manu Konchady. Text Mining Application Programming. Charles River Media. 1st Ed. 2006. This is probably the best book out there to learn text mining using perl. Note that the book is not that heavy on perl code though, but stresses on developing the concepts and the models that are used in text mining. The book covers text mining by talking about various problems that appear in (and can be solved using) text mining. I liked that approach. It also helps that the author maintains an open source library for text mining written in perl.
    • Roger Bilisoly. Practical Text Mining with Perl. Wiley. 1st Ed. 2008. This is probably the only book that directly deals with both text mining and perl. This is also probably a pretty good book, but unfortunately it did not meet my exact needs. I found the topic covered somewhat lacking when compared to Konchady's book.
    • Weiss et. al. Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer. 1st Ed. 2005 (Softcover - 2010). The contents of this book seemed interesting, similar to Konchady's book but the treatment more rigorous. I have put this in my "to check out" list.
  • Software: As with books, there are no dearth of software for text mining either. Since I was looking for perl related solutions, those are the only ones I list here.
    • Text mine: The open source perl library I mentioned earlier.
    • Text::Mining module on CPAN: This module seems comprehensive, but is rather poorly documented. It is not very clear from the documentation available in the module about how to use various sub-modules effectively. Hope the authors work on that.
  • Video learning: There are quite a few lectures available on Youtube. But the best one has to be an online course from Stanford, which is going on right now. The course is from Prof. Manning, the author of the first book in my list. The online course instructs the assignments to be submitted in java or python, and I understand their logic - those two being at the top of the heap of popular languages, but I have decided to use perl for the purpose of doing the assignments. I won't be graded then, but hey, I did not want to be graded in the first place ;-). The reason for choosing perl: my familiarity and knowledge of the strengths of perl in this area (although, I am familiar with java too, at almost the same level), the lure of being able to use the same code in my side project, and possibly reuse some of that stuff in a web-CGI context at some point.
I will update this post if I come across more relevant resources.

Monday, March 5, 2012

Confidence Intervals (Prediction Intervals) in CB Predictor (Part 1): The Calculations

In the CB Predictor result window, we show the confidence interval (CI) [also known in the literature as Prediction Intervals (* See Below)] of the forecasts. Prof. Chatfield has a great paper about prediction intervals, their importance, and how to compute them - interested readers, please have a look.

The default CI’s in CB Predictor result window are at 2.5% and 97.5%, but this can be easily changed using the dropdown at the bottom-right corner of the result window (see figure 1). One can even use custom percentiles for the CI’s to show up in the result window (and also in the Predictor reports). Although this part is quite straight forward, there are quite a few steps in calculating the standard error which might be able to use some further explanation. In this and subsequent posts, we will discuss about some of the interesting stuff around the CI in Predictor.

Figure 1: Result window

Confidence intervals for classic methods
The way we calculate the confidence intervals for classic forecasting methods (four non-seasonal and four seasonal methods) is not very obvious from our charts and reports. We use an empirical formula to calculate the confidence intervals for the forecast of each period.

The formula is from the following reference:
Bowerman, B.L., and R.T. O'Connell (contributor). Forecasting and Time Series: An Applied Approach (The Duxbury Advanced Series in Statistics and Decision Sciences). Belmont, CA: Duxbury Press. 1993.
See Section 8.6, Prediction Intervals (page 427).

Google Books project seem to have only snippets available from the book due to licensing issues, so if anyone is interested in the actual text and do not have access to the book, please feel free to contact me (or Oracle support, for that matter) to receive a scan of the relevant pages.

The method makes a few assumptions:
  • Historical data amount is sufficiently large
  • The forecast errors are normally distributed
Method summary
Without reproducing the formula for this calculation, which is already available in our reference manual and in the reference above, let us describe the process here. Note that, the symbols used in the formula in our manual (or in the book) are not the same as the symbols used in the description below, but the idea is the same.

Let's say we have data points for periods 1 to t: Y(1) to Y(t). We calculate the forecast for the period (t+1)-th using whichever forecasting method is selected. Let's say the forecast is F(t+1). From the fitted data (using the forecasting model) we also can calculate the RMSE, let’s assume that is RMSE(t+1). Now, the normal distribution assumption is defined as: N(F(t+1), RMSE(t+1)). Of course, this normal distribution is then used to calculate the CI’s at different percentile levels. Note here, that the fitted values which we get from the model, which are used to calculate both F(t+1) and RMSE(t+1), are nothing but 1-period ahead forecasts at each period, starting from period, say, k (*), and ending at period t, using the equations of the forecasting model. It is, as if, we were forecasting 1-period ahead at every period in (k,...,t).

(*) Important note: When I say the phrase “starting at period k”, k is the starting period where we can begin to calculate the 1-period ahead forecasts. For some models (like Double Moving Average, or the seasonal models), one can’t really start calculating 1-period ahead forecasts at period 2.

This is the insight that is used in the heuristic. For period (t+2), the forecast F(t+2) is obtained from the method equations. For the standard error, we calculate 2-period ahead forecasts at each period, starting from period, say, k+1, and ending at period t, and then calculate the RMSE of these values w.r.t. the dataset, calling it RMSE(t+2). The normal distribution assumption then, of course, is N(F(t+2), RMSE(t+2)), which is used in calculating CI’s. For period (t+3), we use the same procedure, except now we look at 3-period ahead forecasts, and so on.

The procedure is simple and intuitive, and works pretty well in most situations. Using the above description, one should be able to easily validate the numbers seen in Predictor by implementing a forecasting method and the standard error for that method in Excel.

Confidence Intervals for ARIMA methods
For Box-Jenkins ARIMA methods, we do not use the heuristic mentioned above. Rather, we use the theoretical formula to calculate the standard errors.

A reference for this procedure is below. 
Box, G. E. P., Jenkins, G. M., and Reinsel, G. C. Time Series Analysis: Forecasting and Control. 4th ed. Hoboken, NJ: John Wiley & Sons. 2008. Chapter 5, Section 5.1.1 and 5.2.4.
We follow the procedure exactly (yes, including the calculation of $\psi$ functions), so please refer to the text if you have questions regarding the calculations.

(*) Prof. Hyndman has a recent post where he says that the terms "prediction interval" and "confidence interval" mean different things and they should not be used interchangeably.  I still, personally think that the difference is too technical !!

Update (03/15/2013): Link on prediction interval vs. confidence interval.