Saturday, December 24, 2011

INFORMS 2011 presentation on correlated random variables

In the INFORMS Fall Annual Conference 2011 at Charlotte, NC, which concluded in November, I did a small presentation on correlated random variables. One of the very important aspect of creating a Monte Carlo simulation model is to consider correlations between the random variables. In this presentation, I talk about why considering correlations is important, how to consider correlations between random variables, and how to generate random numbers from correlated random variables (mostly, using the Crystal Ball implementation as a standard). 

Here is a link to the presentation. As always, please comment if you have additional questions.

Thursday, December 8, 2011

New blog on developing Operations Research and mathematical software

A quick note: I have started writing another blog on developing Operations Research and mathematical software. Our development platform is .NET, but the posts in this blog are not necessarily .NET related. They are just notes focused on developing OR software or math software. 

Some of the recent posts:
... and a few more. Some of the posts also contain entrepreneurial thoughts related to developing these type of software. If you enjoy developing math software, I think you will enjoy the posts. Please join me at http://orsoftware.blogspot.com/

Comparing doubles - an old old problem

The problem of comparing two doubles occurs in so many situation, I am sure that in some dev circles, it probably has got the name “the double trouble”. I am referring to the problem where you have to compare two doubles for equality, and based on the result (i.e., based on if they are equal or not), the code either kills a demon or self-destroys itself (put your own example here). In iterative algorithms, these situations are quite common. Searching through my code-base, I seem to have used these checks in the following situations:
  • Matrix and Vector operations
    • In a routine for reducing a symmetric matrix to tridiagonal matrix (believe me, there is a reason why I couldn’t use one of the numerous BLAS packages)
    • Comparing real matrices for equality overload operator
  • Comparing coefficients of two polynomials for solving ARIMA equations in time-series forecasting
  • Calculating approximate correlation coefficients for groups of variables
  • Coding Quasi-Newton methods (specially for solving box-constrained problems)
So, I was quite pleased to see this topic covered in a recent post by Peter Vogel in the Visual Studio magazine. Comparing my code with Peter’s code got me thinking a bit, and hence this post.

Absolute and Relative Comparison
For starters, I have been using absolute comparison with constant epsilon value (1E-15), and it has been working pretty well. The code is simple, and looks like below:

public static bool AreEqual(double a, double b, double EPS)
{
     if ((a < 0 && b > 0) || (a > 0 && b < 0))
         return false;
     else
         return (Math.Abs(a - b) < EPS);
}

Peter, however, mentions: “Any fixed epsilon would likely be too small when comparing large numbers, or too large when comparing small numbers.” That is indeed true, and I generally get by with passing separate epsilon value for other situations. The relative version of the method, from Peter’s post, looks like this:

public static bool AlmostEqual(double a, double b)
{
     if (a == b)
         return true;
     else
         return Math.Abs(a - b) < Math.Abs(a) / 281474976710656.0;
}

For an explanation about the constant, see the original post. As is obvious, this won’t work well with numbers close to zero. There are other schemes, where one calculates (a-b)/a, or (a-b)/b, or (a-b)/max(a,b) and compares with an epsilon.

Using Double.Epsilon
Note, however, that using Double.Epsilon available in .NET as the epsilon value is not a good idea in these comparisons. Double.Epsilon is only useful if one is subtracting values between 0 and 1 as it's the "smallest positive Double value greater than zero". Since double values aren't evenly spaced and their distance increases with their magnitude, one won't ever get something around Double.Epsilon if one is subtracting two arbitrary values.

Other Methods
Recently, I came across another interesting post which shows how to compare two floats (and by extension, doubles) using integers. If you are interested, definitely check this out. Another suggestion that has been floated in these discussions, is to use arbitrary precision arithmetic (APA), but in most applications, doing everything in APA would be really slow. An example of the use of APA is Microsoft Windows Calculator (the calc.exe program).

End Word
One sentiment that repeatedly seems to appear in posts related to this topic is that “the chosen method would really depend on what one would consider equal”. It is definitely application specific, and with enough spread in test cases, a wrong choice of method would show up in wrong result (or, so we hope !!).

References
Following are two really good references I have found on this topic:
[1] Jon Skeet’s coverage for floats and doubles and decimals
[2] Bruce Dawson’s coverage

Update (March 5th, 2017): Updated dead link.

Thursday, December 1, 2011

Recent developments in OR solutions

Recently the following two new OR solutions have come to my attention. Good to see enrichments in the OR solution space.
  • OpenSolver: This is an add-in to Microsoft Excel for solving LP and MIP. It is not a new solver per se, rather it provides an interface similar to the familiar Microsoft Excel Solver (provided by Frontline Systems, the limited version is included with Microsoft Excel for free) for solving LPs and MIPs, using the COIN-OR's CLP (COIN LP) solver through CBC (COIN Branch and Cut). I came to know about this project from a presentation at INFORMS Annual Meeting 2010. At this year's INFORMS Annual Meeting (2011), they were judged winner of the COIN-OR Cup award. Here is the blurb:
Pietro Belotti, the Chair of the Award Committee, delivered the announcement, and this year's award went to Andrew Mason from the University of Auckland and Iain Dunning, currently at MIT Center for Operations Research, for OpenSolver, a linear and integer optimization solver for Microsoft Excel. OpenSolver harnesses the power of the COIN-OR CBC optimization engine and is compatible with Excel 2003, 2007, and 2010. Neither Andrew nor Iain could attend this year's meeting, so the award---a cup full of coins---was accepted by Ted Ralphs on their behalf. In his speech delivered on Andrew's behalf, Ted highlighted the rapid traction OpenSolver gained in the O.R. community, going from 0 downloads 18 months ago to roughly 30 downloads per day currently. He also mentioned the release of the beta version of OpenSolver Studio, which allows users to use Excel maneuver models built using the Python-based PuLP modeling language, which provides a more elegant solution for modeling complex optimization problems.
         I hope to check out the interface sometime.

Sunday, November 20, 2011

Optimization problems in cloud computing

In the just concluded INFORMS Annual Meeting 2011 at Charlotte, NC, I attended an interesting session on a survey of optimization problems in cloud computing. There were actually a bunch of talks on cloud computing, definitely more than what I noticed in the last annual meeting in 2010. That, of course, closely correlates with the importance and popularity of cloud computing, and more vendors entering this space.

The survey was very well presented. One aspect I pointed out to the speaker was the somewhat absence of stochastic optimization problems in this domain. I think quite a few of the problems mentioned in the talk are actually stochastic in nature, but were presented as deterministic mostly for simplicity.

I encourage anyone interested in this area of research to get in touch with the presenter for the slides.

Friday, November 18, 2011

Accuracy of solutions in OR solvers

In the INFORMS annual meeting 2011 which just concluded (I had a great time, by the way), I attended a talk on improving the accuracy of optimization solver. It was a well-attended talk, and the method was also interesting. The method was in the context of LP and MIPs. The speaker suggested that, once the initial optimization solution was found, to reformulate the problem by doing the following (if I have got this right):
  • Scale the problem so that the number of decimal places gets near the decimal point. E.g., if the discrepancy is coming at 6-th places of decimal, scaling it so that the discrepancy appear at 1st or 2nd place of decimal.
  • Consider only active or near-active constraints (the constraint set can be filtered using the shadow prices of each constraint).
  • Solve the problem using warm-start from the previous solution. Iterate this scheme a few times.
I guess that this scheme can be extended to constrained nonlinear  programming problems as well (and, I think, the author mentioned as much), but my first instinctive question was completely different. I am not sure why someone would do this, i.e., need that much of accuracy, in the first place, other than bragging rights when competing with other solvers. Unless one is solving textbook problems with given data, typical data accuracy does not go beyond the decimal point, and even if it does, seldom beyond 1-2 places of decimal. Why solve a problem accurately upto 10 places of decimal when the input data is not that accurate?

Now, I am aware of industries where the data are often between 0 and 1, for example chemical processes, but shouldn't we be rescaling these problems before solving, rather than doing the above procedure after solving? May be I am missing something here !!

Thursday, November 10, 2011

Development burden on lone consultant

In a recent session of the INFORMS Rocky Mountain chapter, an OR/MS consultant was presenting on an implementation of vehicle routing problem using Microsoft Excel, Microsoft Mappoint and Microsoft Solver Foundation. He also had a slide on the key takeways from the project. I found the slide pretty interesting, and here are my comments on a few points mentioned there.

The Jack-of-all-trades aspect
I am not sure whether this consultant works completely alone or has a small team (like 2-3 people), but the diverse range of things he had to do for the project, like data cleaning (and writing VBA macros for that), OR modeling, OR software (he developed and solved a part of the problem using Excel and Microsoft Solver Foundation Express Edition, and a part of it using a custom heuristic program that uses Tabu search, developed in C#.NET), report generation (VBA Macros) etc., displays the versatility of this guy (or his small team). A large part of OR consultancy requires proficiency in various non-OR programming languages.

The OR/MS aspect of the project
The first few points on the slide was about comparing the time that was required to develop the solution (let's call this X - we love algebraic modeling, don't we?), and the amount of time that was actually put in OR stuff (let's call this Y). I was not surprised when he noted that:
X - Y ≥ 3*Y => X ≥ 4*Y
I have heard this premise before. In INFORMS Annual Meeting 2010, I was attending a panel discussion about  "OR/MS Consulting - The Challenges". The panelists were all consultants and a few were 2-3 person shops. Interestingly, the presenter of the talk I mentioned at the beginning of this post was among the panelists. This panel discussion, by the way, is one of my favorite session ever attended in an INFORMS annual conference - I wish somebody recorded a video of the session. Their consensus seemed to be that if one is the OR modeler in a small shop, the OR work is often less than 15% of the total development work. If one is part of a slightly larger organization, one would be lucky to have it near 25% mark.

Thoughts for would-be OR/MS consultants.

Update [12/22/2016]: Label consolidation.

Ateji in deadpool: interesting entrepreneurial feedback

Recently came to know from an entry in Erwin's blog that Ateji has ceased to exist. I was aware of this company, and its technology, but I would not say that I was a close follower, mainly because I work in .NET domain. I also liked the OptimJ GUI. Regarding technology, as I understand it, Ateji introduced a language-based extension to Java, to express optimization problems (like objective functions, constraints etc.) more succinctly. They also had language extensions (and library support, again in Java) for easily implementing parallelism. This can possibly be compared to the availability of TPL (Task Parallel Library) in .NET (starting from 4.0). Please correct me if my understanding here is flawed.

One thing of mutual importance was (and is) the multi-core execution of optimization solvers, or injecting parallelism easily in mathematical optimization problems. From that perspective, the closing post from Ateji is interesting.  Here is a verbatim excerpt from that post on the issues the company (and the technology) faced regarding adoption:

Adoption issues
We did anticipate that introducing a new language-based technology and getting it adopted would take a long time. The truth is, it takes much longer than anticipated. Many “new” languages that are coming into fashion now were first introduced 10 or 20 years ago.
There are however examples of companies that successfully built a business on a language. Think of Oracle and SQL. They had a strong business case: suddenly, all software developers became able to access databases in a simple and intuitive way. We thought we had a similar strong business case with Ateji PX and parallel programming: suddenly, all Java developers became able to write parallel code for multicore processors in a simple and intuitive way.
Our largest mistake has probably been to surf on the popularity of the Java language and on the hype of the “multicore crisis” (all software has to be adapted for the coming generation of new hardware) without developing a fine enough understanding of the market.  We discovered the hard way that Java developers just don’t want to hear about parallel programming, regardless of how easy we made it. They’d rather use only one core out of their expensive 16-core server than write parallel code (we’ve seen the case). They long for solutions that hide parallelism. This may change when 1000′s of cores become the norm, future will tell.
At the other end of the spectrum, developers that do parallel programming just don’t want to hear about Java, because 10 years ago it used to be slower than C++.

I guess, the key take away here is that although people (more specifically, developers who consume APIs) understand and appreciate the value of multi-core execution, it is not mainstream yet, and there is still a large inhibition from actually using it. I am reasonably sure that things will change, but it will take time. We are already seeing a proliferation of languages supporting parallelism constructs out-of-the-box, but it still puts the onus on the developers to use them. In the mean time, the best way to go, if one is into the business of developing computing libraries, is to incorporate multi-core execution in the library itself and make it completely transparent to these users. In other words, the only way to achieve developer adoption,  is to have a no-brainer drop-in type solution. As Erwin puts it in his post, the target audience for Ateji's technology was too niche to support the company.

Best wishes to the talent pool for their future endeavors.

Saturday, November 5, 2011

Sensitivity chart does not converge in Crystal Ball

Recently I received a support request on a sensitivity chart issue, where the top few assumptions contributing to the variance of the target forecast always seem to change when the simulation is reset and run (and the seed is not set). In general, even if the seed is not set, one should expect convergence of the list of top contributor assumptions when enough trials are run. Even if enough trials are not run, the list should not vary a lot with each reset.

The model was simple. There were around ten assumptions, linked to one forecast through a series of formulas. The flowchart below (Figure 1) shows the type of formulas that were being used.
Figure 1: Model Flowchart

While running the model, two things were noticed:
  1. As mentioned before, the sensitivity charts do not stabilize even using a large number of trials (either there is a change in the contribution to variance value, or the position of a variable on the chart or both).
  2. Switching to the Sensitivity data view, it was noticed that the rank correlation values are at or near zero (considering upto 2 places of decimal). That would indicate spurious correlations, but in this model, that should not be the case.
These are signs that the sensitivity chart may not be working as intended and prompted me to check the limitations of the sensitivity chart (also available in the User's Guide, installed with your installation of CB. Available from Excel at: Excel 2007 > Crystal Ball tab > Resources > Crystal Ball Documentation, and then, Analyzing Other Charts > Understanding and Using Sensitivity Charts > Limitations of Sensitivity Charts). This turned out to be a classic case of one of the limitations of our sensitivity analysis: the analysis does not support non-monotonic relationship between the assumptions and the target forecast. Looking at the formulas in the model, the forecast is an aggregate of functions containing the absolute values applied on assumptions. That makes the relationship between the assumptions and the forecasts nonmonotonic, and hence the problem.

To ascertain that this is indeed the problem, and the relationships between the assumptions and the target forecast are non-monotonic, we can run the Tornado chart on the model and look at the spider chart. This chart can help identify which variables have a non-monotonic relationship with the target forecast. Looking at the spider chart (Figure 2), we see that this is indeed the case.
Figure 2: Spider Chart
The chart shows that each line is broken in the middle, which signifies non-monotonicity. If the relationship was monotonic, then the curve for each variable would have been a straight line from the left end to the right end.

Update (11/7/2011): Language.
Update (12/8/2011): Language.

Wednesday, November 2, 2011

Saving partially completed OptQuest model runs in Crystal Ball

There are times when working with a large OptQuest (a.k.a. Decision Optimizer) model which is running for hours with no end in sight, you want to take a break and do something else with the computer. But, you do not want to lose the progress that has been achieved by the solver so far - so you need a way to save the partially completed run. Although there is no direct way to achieve this through a button click, this can be done. Here we discuss two ways to do this, depending on how long a break you (and your computer) need(s).

Short break
It is possible to informally save a partially completed OptQuest Run. You can hit the red 'Stop' button on the Crystal Ball control panel, and that would essentially pause the solver. When you are ready, you can hit the green 'Continue Optimization' button, and that would continue the optimization from where it stopped, with all the settings and the progress intact.
Notes:
1. This technique would work on general optimization problems, but would not work if you are trying to evaluate the efficient frontier.

Long break
It is also possible to completely shut down Excel and Crystal Ball with a partially completed model. In this case, after you hit the red 'Stop' button on the Crystal Ball control panel, you would then have to export the intermediate solution to the spreadsheet by using OptQuest Results window -> Edit -> Copy Best solution to Spreadsheet. Then you can save and close the file. When you start up the file again and run OptQuest, it will use the saved solution as the starting solution (base value of the variables) and improve over that and should still end up with the same solution.
Notes:
1. Notice the word should in the last line. In this approach, OptQuest looses its history of previous solutions tried, and might end up re-examining some of the solution space.
2. Hence, the solution path and time taken might differ from what would have been the original solution path had you not disrupted the solver. The resulting solutions in these two cases (undisturbed vs. restarted) should not be considerably different though.

Enjoy your break !!

Note: This post is based of a recent CBUG discussion on the same topic.

Thursday, October 27, 2011

Event modeling in depth in CB Predictor (Part 1)

Event modeling is a new feature in the most recent release of Crystal Ball (11.1.2.1). In the last post, we talked about the basics of event modeling using CB Predictor, and introduced the UI for this feature. In this and the following post, we will dive deeper and discuss in greater detail about certain aspects of event modeling in CB Predictor. In this post, we will touch upon a few limitations of the CB event modeling feature. We will conclude with a discussion about the relationship between data screening and event modeling.

Limitations of CB Predictor event modeling
A few limitations of CB Predictor which you may or may not realize while modeling events are:
  • This is mentioned in the last post: we need to have at least one instance of the event happening in the historical period in order to use that event in the future.
    • Workaround: If you think you have an event for which you know, or have already calculated, the effect on the time series, you can directly add the effect of that event in the future periods. Unfortunately, this has to be a manual process, CB Predictor cannot be used.
  • CB Predictor does not allow defining overlapping events in the historical period or in the future. That is, there cannot be any common day between two events. It is difficult to calculate the effect of each event when two events overlap, hence the limitation.
    • Workaround: Similar to previous point, you have to estimate and remove the effect of one of the event from the time series. For example, if events A and B both happen on a specific day, and you can estimate the effect of either event A or B, then you can let Predictor handle only A (or B), by subtracting the effect of B (or A) from the series, and then adding back the effect of B (or A) after the forecasting is done.
  • CB Predictor issues an warning if more than 10% of the historical values are defined as events. Since CB Predictor calculates the effect of an event by using the same algorithms as imputing missing values, too many events defined could affect the predictive accuracy.
  • If you are using multiple linear regression in CB Predictor and an event is defined on all the series, then, as you would expect, the effect of these events are not explicitly added to the dependent variable(s). Rather, the effects of the events are expected to be included through the regression equation. On the other hand, when an event is defined for a few of the series designated as independent variables, then we should explicitly account for events defined only on the dependent variable. Currently that is not done. Please let us know if you face this situation often, and what you would think about the suggested approach.
Event modeling and data screening
Events modeling and data screening are both available in the ‘Data preparation’ phase in CB Predictor, and are available in the ‘Data Attributes’ panel. We use similar approach to both, once an outlier has been identified, it is treated as a missing value and the approximate value is then imputed using one of the available algorithms. Here are some notes which would be helpful if you are planning to use both features together:
  • Purpose of both the features are same, to find out and explain outliers.
  • One would use data screening to detect outliers which are unusual values without a known cause. On the other hand, one would use the events feature of Predictor to define identifiable occurrences that have affected historical data and could affect predicted data.
  • CB Predictor takes into account events first before screening data for outliers. That means, if you define an event, the effect of that event is first factored out, the time-series models are then run and the outliers detected as per the user-set rules.
  • In general, that would mean that, if an event has been modeled correctly, an outlier would not appear on the same date(s) as that of an event. If you notice otherwise, please let us know.

Summary
Event modeling is a powerful new tool in CB Predictor, and some extra care is needed when using this feature.

Update (10/31/2011): Miscellaneous corrections.

Saturday, October 1, 2011

Presentation on introductory OR Problems in Analytics

I was recently invited by Prof. Sriram Sankaranarayanan of Computer Science in University of Colorado at Boulder for a guest lecture in his linear programming class. Sriram is my old friend from undergraduate years in IIT Kharagpur, and has a stellar academic record over there (as well as in his graduate years).

Anyways, since this was going to be an one-off presentation, I decided to touch on a few different things in the presentation. I started off with a few examples of optimization that we face in our daily life and do not often realize. Examples include route planning, some version of traveling salesman problem, revenue optimization etc. I followed these examples up with a simple example of modeling an optimization problem where the constraint needs to be figured out from given and somewhat unstructured data - an introduction to a so-called Analytics problem. After a few follow-up slides on the different type of computational problems in Analytics, I wrapped up with career options in OR/Analytics and a few online resources for optimization.

If you are interested in the presentation material, check it out.

Update (10/3/2011): Updated title.

CB 11.1.2.1 features: event modeling in CB Predictor

As mentioned in the announcement post for the new release, another new feature in Predictor in this release is the ability to model events in your historical data. Event modeling is an important aspect in time-series modeling and analysis, since, if they are not accounted for, the resulting forecasts can be way off. Events can be either unplanned or planned, happening irregularly or at regular intervals. Examples include machine breakdown in a production environment, planned or unplanned maintenance, workers strike etc.

This post is in two parts. In this first part, we will discuss our implementation of events in Crystal Ball PS1 release. In the second part, we will analyze some results regarding best practices to adopt when modeling time-series with embedded events.

Why perform event modeling?
As mentioned above, if we do not account for events, usual or unusual, in the historical data, we face a few problems:
  • Events always tend to distort the actual pattern in the data, by either emphasizing or attenuating signals.
  • The model gets fitted to the distorted data. In case of CB Predictor, that may result in the wrong model being selected as the best model (since, we just look at the user-selected error measure when deciding the best model), or the parameter values are incorrectly calculated for the (eventually) right model.
  • The above two issues lead to noisy forecasts.
This is precisely the reason why event modeling is often considered as one of the “data cleansing” steps. We definitely regard this as one important step in preparing data for time-series forecasting, which is why we have included this step in the “Data Attributes” section, along with other data cleansing steps like filling-in missing values and outlier detection.

GUI Interface
As mentioned above, the entry point to events modeling is in the “Data Attributes” panel, as shown in figure 1.

Figure 1: Access to events
Once the ‘View Events ...’ button is clicked, we get a dedicated window for events modeling, shown in figure 2. From this window, we can add or update an event, using the dialog shown in figure 3, or we can delete an event.

Figure 2: Events GUI

Figure 3: Add/Edit an event
Event Types
CB Predictor supports different type of events in this framework. An event could have happened once or multiple times in the historical date range. The same event can be predicted to happen in future forecast periods as well, resulting in an uplift (or downgrade) of the original forecast value. Common sense restrictions apply, like, an event has to have at least one occurrence within the historical date range, in order to be used in future forecasts (otherwise, its effect cannot be calculated). An event occurring at multiple times, either in the past or in future, can be defined at regular or custom interval. An event can also span multiple consecutive time periods in the historical data or in the future forecasts. For more information about defining different types of events, please refer to the Predictor User’s Guide (Setting Up Predictor Forecasts > Selecting Data Attributes - Seasonality, Events, Screening > Viewing and Managing Events).

Algorithmic Details
We use a simple algorithm for analyzing events in historical date range. For each occurrence of an event, we calculate the effect of the event by using one of the algorithms for imputing missing historical data values. If there is a single occurrence of an event in the past, the same effect (either uplift or downgrade) is used for future occurrences of the event. If there are multiple occurrences of an event, we extrapolate the effects using a line fit, to calculate the effects of the same event in future.

Conclusion
Check out the new event modeling capabilities in CB Predictor and let us know how it works for you.

Friday, September 30, 2011

Showing more precision in CB Predictor result window

Have you ever wondered how you can show more number of digits in the CB Predictor result window? The other day, I was checking the difference between the error measures coming out of two forecasting models, and the default view of 2 decimal places was not adequate for the investigation.

How to do it?
Unfortunately, we do not have any preference which can control this while you are viewing the result window. The only way is to increase the precision of the input dataset in Excel. Format the complete range of the data-set as numbers, with the specific number of digits you want to see after the decimal on the results window. As in Crystal Ball, in CB Predictor we maintain this formatting.

Excel 2007 ribbon showing the buttons to increase or decrease decimal

Dialog for formatting Excel cells


A Gotcha
Note the phrase "complete range" that needs to be formatted in Excel in order for this to work. If there are pre-data gaps in your series, do not ignore those empty cells while formatting.