Thursday, April 14, 2011

Entering Large Number of Constraints in OptQuest

There was recently a discussion in our Yahoo Crystal Ball User Group (CBUG) on the ways in which one can enter constraints in OptQuest, specifically, when there are too many of them. I wanted to clarify some of the options that are available for doing just that in the current releases (CB 11.1.2.X).

As I indicated in the group post, there is no neat way to do that. But there are two ways one can achieve the same result, and these came out from the follow-up research I did. Let’s discuss the two options.

Option 1: Constraint convolution
The constraints can be convoluted together to create a super-constraint, which is satisfied if all the constraints are satisfied. Dave Hammal, our resident Solution Specialist, talks about this approach while answering a similar question in our LinkedIn group. Basically, one would create a bunch of Excel formulas for each constraint which would report 1 if the constraint is satisfied, 0 otherwise. One can then create a constraint using one of the aggregate functions, e.g., minimum, to indicate the satisfaction of all the constraints. This final constraint can be added to OptQuest.

Note that although this option works, OptQuest would still try out substantial number of infeasible solutions before closing in on valid solutions. This method actually distorts the solution space, as OptQuest does not get to use information from the individual constraints but would only use the information from the final constraint. This in turn can increase the time taken by OptQuest to solve a problem. Even then, I think this is an easy option to try out before trying the second option described below.

Option 2: Using OptQuest DevKit

This is the best way to programmatically add numerous constraints to a model. OptQuest is also exposed as a COM DevKit, and can be used from VBA. You can get the details of the DevKit methods from the Windows help file “CB COM DevKit.chm”, that is available in your installation folder under Docs (something like: Program Files\Oracle\Crystal Ball\Docs). Check out the OptSetup.AddConstraint method, which takes a string equation as one would enter it in the constraint panel. Once the VBA code is executed, the constraints are available in the model, and can also be accessed through the GUI. Note: you do not need to update/install anything in order to access the Developer Kit, this component gets installed as you install any version of Crystal Ball.

This option generally works better than the first option above, and can sometime speed up the solver execution substantially.

Option 3: Remodeling
It would also help if you remodel the problem to decrease the number of constraints, if possible. Case in point: if you are entering constraint at an individual (e.g., item) level, it might help to model a constraint at a group level. That would not only decrease the number of constraint, but also help in loosen tight constraints in a solution space, and facilitate finding a solution. Of course, this solution may not always be feasible (e.g., you might be already modeling at an aggregate level), but at a minimum, this is worth a thought.

General Note
A general note in this context: while solving simulation optimization problems, we have seen the performance of OptQuest solver engine deteriorates somewhat for more than 300 or so decision variables. We will be interested in hearing how many constraints you are using in your optimization problem, and the solver performance you are obtaining.

Finally, as mentioned in the forum post, we would like to provide a way to directly enter these constraints from GUI, like Excel Solver, in one of the future releases.

Update (5/16/2011): Added more notes on the usage.
Update (3/18/2013): Added more notes and a third option.

Thursday, April 7, 2011

Triangular distribution: the mean has a range

I was trying to decide on a distribution for a few small datasets (18 points at most). Specifically, I thought triangular distribution would be a good choice, since what we had was limited amount of data, and they were always skewed, mostly towards left. Triangular distribution is general enough to capture the full range of risk in the data. But rather than trying to fit the data directly to a triangular distribution, I thought I will use a parametrized fit, in which I will calculate the triangular fit from the minimum, maximum and average of the data. The rationale was that, since I did not have enough datapoints, a direct fit would not be great anyways, and since I was well set on using triangular, a parametric fit would work better.

Given the minimum, maximum and average of a dataset, finding the triangular parameters minimum (a), most likely (b) and maximum (c) are easy. The average (mean) of the distribution is given by: 
$\mu = \frac{a + b +c}{3}$
So, given $a, c$ and $\mu$, we can calculate b:
$b=3\mu - a - c$

While applying the formula to the statistics from a few datasets, it soon became apparent that the triangular distribution definition is not going well for some of them. For example,
$\min=0, \max=105, $ average=$37.4 \Rightarrow \textrm{most likely}=7.333$
but,
$\min=5, \max=175, $ average=$54.3 \Rightarrow \textrm{most likely}=-17.1$
Now, the most likely value from the last example cannot be used, since it is lower than the minimum. In fact, while checking the math, I realized that a triangular distribution cannot support arbitrary average values, and just because the average is in between the minimum and maximum value does not mean that it could be supported by a triangular distribution with those minimum and maximum. The range of average that can be supported by a triangular distribution with minimum a and maximum c can be calculated easily - it is given by:
Range = $[\frac{2a+c}{3}, \frac{a+2c}{3}]$. 
If the average value from the data is not within this range, it cannot be used for defining a triangular distribution.

That made me realize, that my initial assumption of using triangular distribution for these datasets is wrong in itself. Further research suggested that the best distributions to use for these datasets are beta, Weibull or Lognormal, and indeed triangular seemed to be towards the end of the list, sometimes even below normal distribution.

Note: I have started using MathJax to render mathematical formulas in blog posts. If you see strange words in the blogs, either you are using dedicated readers which might have trouble rendering these formulas correctly, or the browser page needs to be refreshed. If you are using a dedicated reader, please visit the post website to view the post in a browser.

Sunday, April 3, 2011

ICC World Cup 2011: Statistical Analysis of Cricket Matches

India just won the ICC Cricket World Cup 2011 today, beating Sri Lanka by 6 wickets. They had to chase 275 and surpassed the runs with 10 balls to spare. It was a really good match all along, with both the teams trying to deliver their best.

Being in the United States, where all the rage in sports statistics is with respect to baseball (or so it seems, although, I will have to agree that the entries in INFORMS Blog Challenge on OR and Sports are quite diverse, with one entry on cricket as well), I do not get to hear a lot about statistics and Monte Carlo simulation in relation to cricket. Just like baseball, cricket is ripe for an infusion of heavy statistics, and in fact, it might be already there. There are plenty of statistics available to whet anybody's analytic appetite, see for example the Statsguru section of ESPN Cricinfo website, where you can create customizable queries to get any statistics of your choice. Anyways, I decided to take a look at the literature on this topic, and was happy to notice quite a few interesting papers. Here I list a few, which I found interesting:

[1] M. J. Bailey & S. R. Clarke. Predicting the match outcome in one day international cricket matches, while the match is in progress. Journal of Science and Sports Medicine. 2006. Vol 5, p480–487.
This paper uses multiple linear regression analysis to try to predict various outcomes of a match with the data that is already available.

[2] Swartz, T. B., Gill, P. S., and Muthukumarana, S. Modelling and simulation for one-day cricket. The Canadian Journal of Statistics. 2009. 37(2). p143–160.
This paper goes pretty close in simulating ODI matches, to answer various questions, like how many runs will be scored etc.

[3] B. M. de Silva, G. R. Pond & T. B. Swartz. Estimation of the magnitude of victory in one-day cricket. The Australian and New Zealand Journal of Statistics. 2001. 43. p259–268.

Naturally, most of the these articles (including all of the above ones) have appeared in journals dealing with statistics. The other avenue for related research are the journals of Operations Research. There are quite a few articles appearing in OR journals regarding modeling of batting orders, bowling orders etc. One of the most famous article appearing in an OR journal regarding cricket is the one regarding the Duckworth-Lewis criteria, which is already often used in practice to determine the winning team (or reset batting target) when there are unexpected interruptions in a match, e.g., rain. The article reference is as follows:

F.C. Duckworth & A. J. Lewis. A fair method for resetting targets in one-day cricket matches. Journal of the Operational Research Society. 1998. 49, p220–227.

Could we have simulated the result of the World Cup Cricket final match which happened today? One thing is for sure, even if we would have had the correct result from simulation, it wouldn’t have reduced the fun of watching it by any bit.

Update (11/16/2011): I have a few ideas about modeling cricket matches using Monte Carlo simulation, and would like to collaborate on a research paper. If anyone reading this is interested, please get in touch through comments.

Saturday, April 2, 2011

Focusing on broader predictive analytics

As you may have noticed, I have changed the title and description of the blog to emphasize its general focus on various topics in predictive analytics and OR. I will still have posts concerning Crystal Ball modeling approaches, but I have decided to broaden the coverage to include software independent modeling techniques, other industry news and various notes.

Hope you will continue to enjoy the posts and find them useful.