Saturday, December 25, 2010

Using spreadsheet software like Crystal Ball for Monte Carlo simulation

One of the important aspect of risk analysis is selecting appropriate software. My tutorial paper introducing Monte Carlo Simulation ([1]) has a section titled Monte Carlo Simulation Software, which lists the ways Monte Carlo simulation can be implemented and used in practice. It is important from a practitioner’s perspective to use the software which will be most helpful to achieve the goal at hand: to analyze and make decisions on the problem s/he is working on.

Let us start off with a quick summary of this section from the paper. There are typically four options to choose from when performing risk analysis using Monte Carlo simulation:
a. High-level programming language like C/C++/Java/C#
b. General purpose simulation and modeling environment, like Matlab, R, Scilab, GNU Octave
c. Stand-alone simulation modeling software
d. Software add-in to spreadsheets like Oracle Crystal Ball

In this post, let us review some of the questions that one can ask when choosing the software to use. We will also comment on when a spreadsheet based software like Oracle Crystal Ball would be an appropriate choice.

1. Deterministic modeling: One of the first steps for performing risk analysis is developing a deterministic model of the situation. In this context, one can ask if the deterministic model can be developed in a spreadsheet environment? Some models like financial models are typically convenient to model on spreadsheet. On the other hand, modeling of physical systems are difficult on spreadsheet, since you would require other computationally intensive modules. Here is where the skillset of the analyst come in: if you are already familiar with Oracle Crystal Ball and would like to leverage the knowledge rather than learn new software, there are ways around this problem. One can use the macro hookup facility (macros to run before and after each trial) and run arbitrary code. If the modules were developed for running in *NIX platforms, one can use Cygwin to compile those for the Windows platform, and then call them from macros. There are a few other possibilities for other situations (like Matlab code etc.), but this is the general idea.

2. Access to historical data: Where does the data, that you want to use in modeling, reside? If they are in database servers, one can go either way on using Crystal Ball or using other software. On the other hand, if they are already in CSV format, or better yet, in spreadsheets like MS Excel, choosing CB would be easy.

3. Modeling uncertainty: Do you have historical data and want to use them to model uncertainty? In CB, you can use historical data to model either probability distributions, or use them to forecast future values which can be used as assumptions in your model. For a particular modeling exercise, you might want to do both: for some variables you would perform distribution fitting, for a few others, you need time series forecasting. Crystal Ball would be useful in this scenario: you can do both using the same software. Even if you want to model distributions from expert opinions, CB can still be used (alongwith appropriate distributions like triangular or betaPERT).

4. Distribution types: Does your model involve non-normal distributions? If your model involves just normal distributions for the uncertainties, it is easy to do in a regular programming language or even in environments like Matlab with a few lines of code. On the other hand, if your model involves non-normal distributions alongwith correlation structure between distributions, Crystal Ball would really be useful to model the situation easily.

5. Optimizing design parameters: Does your model involve optimizing design parameters? That would mean using a tool for performing simulation optimization, which can be done easily in Crystal Ball. Typically in design problems, different design objectives need to be balanced, which might be posed as an optimization problem. You might have to switch to other software for this part of the analysis if you choose other software.

6. Access to charts and reports: Finally, any professional analysis ends up with reporting, which needs access to quality charts and reports. Crystal Ball has inbuilt facility for generating charts and reports, whereas, if you use other general purpose software, you might have to generate these charts and reports manually. Depending on the volume and frequency of the analysis, this can consume considerable amount of time if done manually.

References:
[1] Raychaudhuri, S. Introduction to Monte Carlo Simulation. In proceedings of Winter Simulation Conference 2008. Miami, FL. December 2008. Visit this link for access to this paper.

Friday, November 5, 2010

Oracle Crystal Ball at INFORMS 2010

Are any of the readers attending the annual Operations Research conference (known as INFORMS) being held at Austin, TX, USA from Nov 7th-Nov 10th? We will be there with a booth in the vendor area. If you are attending the conference, you are welcome to stop by and say hello, or bring questions (analytical or otherwise) to discuss at the conference. Our academic representative, Michael, and myself would be at the conference.
We have two presentations at the conference as well:

1. Title: Oracle - Crystal Ball
    Session Code/Venue: TD48 - Center, Room 9, Level 2
    Time: 5:15pm-6:00pm

2. Title: Time Series Forecasting in Oracle Crystal Ball
    Session Code/Venue: WA31 - Center, Room 5C, Level 3
    Time: 8:00am - 9:30am (2nd talk in the session)

There are a few other presentations that show off the capabilities of Crystal Ball.

3. Title: New Developments for solving Real World Optimization Problems by Marrying Simulation and Optimization (Tutorial)
    Session Code/Venue: WA06 - Center, Ballroom E, Level 4
    Time: 8:00am - 9:30am

Thursday, October 14, 2010

The extended family of CB Assumptions (Part 2)

This is the second part of a series of posts on how to model new (and possibly exotic) distributions in Crystal Ball. The other parts can be found here:
Part 1: The extended family of CB Assumptions

In the last post, we discussed the way to model two distributions: Erlang and two-parameter lognormal. In this post, we will continue with a few more distributions that can be simulated in CB by modifying one or more parameters of one of the existing distributions. Data can also be fitted to these distributions by locking one or more parameters to specific values. For a detailed discussion on locking parameters while fitting to distributions, check out our coverage of two-parameter lognormal distribution in the previous post, and our help documents.

Example 3: The Maxwell distribution or Maxwell-Boltzman distribution
This distribution is used in statistical physics, specifically as the distribution of molecular speeds in thermal equilibrium.Mathematically, this is a variant of the gamma distribution.
  • Notes and formulas: Check out the Wikipedia entry and the MathWorld website
  • Quick summary: This distribution is a special form of the gamma distribution. A Maxwell-Boltzman distribution with parameter 'a' can be modeled by a gamma distribution with location = 0, scale = 2a2 and shape = 3/2.
  • Generate random numbers: Define a gamma distribution assumption with location = 0, scale = 2a2 and shape = 3/2, where 'a' is the parameter of the distribution.
  • Fit to this distribution: To fit a dataset to this distribution, we have to lock both the location and shape of the gamma distribution to 0 and 3/2 respectively. The parameter 'a' can be found out from the fitted scale. If the fitted scale is 's', then: a = sqrt(s/2).

Example 4: The Chi-square distribution
This distribution is commonly used in statistical inference. One of the common Goodness-of-fit (GOF) statistic used in distribution fitting is the Chi-squared statistic, which, of course, follows the Chi-square distribution. To model this distribution in CB, we will use the same technique that we used to model the two-parameter lognormal distribution.
  • Notes and formulas: Check out the Wikipedia entry and the NIST website
  • Quick summary: This distribution is a special form of the gamma distribution. A Chi-square distribution with 'd' degrees of freedom can be modeled by a gamma distribution with location = 0, scale = 2 and shape = d/2. We use this method in CB to directly construct the specific Chi-squared distribution and calculate the critical values (p-values) of the Chi-squared GOF in distribution fitting, that is reported in the distribution fitting results window.
  • Generate random numbers: Define a gamma distribution assumption with location = 0, scale = 2 and shape = d/2, where 'd' is the degrees of freedom of the Chi-square distribution.
  • Fit to this distribution: Fitting to a Chi-square distribution is slightly tricky, since we do not have the ability to lock the scale of a gamma distribution in distribution fitting. We will come back to this later in a future post in this series.
Example 5: The Rayleigh distribution
Often used in the physical sciences, this distribution is a special case of the Weibull distribution.
  • Notes and formulas: Check out the Wikipedia entry
  • Quick summary: This distribution is a special form of the weibull distribution with shape = 2.
  • Generate random numbers: Define a Weibull distribution assumption with shape = 2.
  • Fit to this distribution: Fitting to a Rayleigh distribution is also easy, just lock the shape value of the Weibull distribution to 2 while fitting.
Example 6: The Pearson type V distribution or Inverse gamma distribution
Mostly used to measure the time taken to perform a task. Also known as inverse gamma distribution.
  • Notes and formulas: Check out the Wikipedia entry
  • Quick summary: This distribution can be modeled using gamma distribution. If X ~ PearsonV(scale = a, shape = b), then Y = 1/X ~ gamma(Location = 0, scale = 1/a, shape = b), so it follows that X = 1/Y.
  • Generate random numbers: Given the parameters of the PearsonV distribution (scale=a, shape=b), set up a gamma distribution with parameters: Location = 0, scale = 1/a, shape= b. Next, set up a forecast having the inverse of the gamma assumption. As you run the simulation, the forecast values would represent the random numbers from Pearson-V distribution.
Figure 1: Excel worksheet for simulating Pearson-V distribution in CB
Figure 2: CB chart showing the Pearson-V distribution with statistics
In the above figures we show the details of simulating a PearsonV(scale=3, shape=5) distribution. Notice the proximity of the mean and variance values in the spreadsheet (Figure 1) with those in the statistics window of the forecast (Figure 2).
  • Fit to this distribution: Fitting to a Pearson-V distribution is also easy. Given a dataset, follow the steps below:
    • Calculate the inverse of the values
    • Fit these values to a gamma distribution with location locked to 0.
    • The scale of the PearsonV is the inverse of the fitted gamma scale. The shape of both the distributions are the same.
Figure 3: Fitting data to a Pearson V distributio
In the figure above (Figure 3), we started with a set of 1000 datapoints from a Pearson-V(scale=3, shape=5) distribution, generated using the technique mentioned above. The best fit gamma distribution has the following parameters: scale = 0.31, shape = 5.42 (upto 2 places of decimal). Transforming back to Pearson-V distribution, we get a scale of 3.23 (=1/0.31) and shape of 5.42. The fit is reasonably close for a dataset of 1000 points.

    Monday, October 4, 2010

    The extended family of CB Assumptions (Part 1)

    Over the years, we have received lot of requests for adding new distributions to our distribution gallery. We have added new distributions in the recent past (like betaPERT), but not at the rate at which we get the requests. One of the primary reason for not implementing some of requests is that many of the requested distributions are special forms of distributions we already support (no, that is not the only reason, but it is often the reason). Here we will discuss some of the distributions which fall under this category, and are highly unlikely to be supported natively for this reason.

    Before starting though, let's make sure we understand, what is meant by supporting a distribution in Crystal Ball. If we support a distribution, we will have to be able to do the following two things:
    • Generate random numbers from this distribution
    • Fit data to this distribution and generate Goodness-of-fit (GOF) statistics
    The above two tasks are, not surprisingly, ordered by the ascending order of difficulty. If you want to use a new distribution in your simulation model, you might want to do one of the above two tasks, and in some cases, both. So, let's get started with two easy examples in this post.

    Example 1: The Erlang distribution
    This one is really straightforward. The Erlang distribution is a special case of the Gamma distribution where the shape parameter is an integer. Since we accept both integer and non-integer values for the shape parameter of our gamma distribution, simulating Erlang distribution with the gamma distribution is easy.
    • Notes and formulas: Wikipedia entry linked above is sufficient.
    • Generate random numbers: Define a gamma distribution assumption as usual. Just use the integer shape value.
    • Fit to this distribution: Fitting to an Erlang distribution is slightly tricky. We will come back to this later in a future post in this series.

    Example 2: The two parameter lognormal distribution
    Some time back, we introduced the 3-parameter lognormal distribution, the extra parameter being location. The 3-parameter lognormal distribution is somewhat unusual, since it is not covered in any textbook and the use of the distribution in this form is not well documented. Our main reason for introducing it was to offer more flexibility in the distribution (like having values < 0). Nevertheless, although we had made sure at that time that this new distribution is completely backward-compatible with the classic two-parameter distribution (all models having the two parameter distribution will run the same way without any change), I am sure this change might have taken a few souls by surprise (yes, that might be putting it mildly in some cases). So, let's go back in time and simulate the classic 2-parameter lognormal in the current version of Crystal Ball.
    • Notes and formulas: Wikipedia entry will serve us good here.
    • Generate random numbers: That's easy - define a 3-parameter lognormal distribution with the location set to 0.
    • Fit to this distribution: This is also easy with a few new features which were also introduced in the same version of Crystal Ball. When you get to the 'Fit Distribution' dialog from the 'Distribution Gallery', switch on 'Lock parameters', available at the bottom left of the screen.That would bring up the 'Lock Parameters' dialog. Select to lock the location of the lognormal distribution to 0, as shown in the screenshot below. This would result in fitting to the classic lognormal distribution, and you would get all the GOF statistics for the fit.
    Fitting data to a 2-parameter lognormal distribution
    The same technique can be used to model the 2-parameter versions of gamma and Weibull distirbutions. The textbook notes of these distributions typically have the scale and shape parameter. CB implementation contain the location parameter for added flexibility, but one can model the two-parameter versions of these distributions easily as discussed above.

    Part 2:  The extended family of CB Assumptions

    Saturday, September 11, 2010

    Validation with Excel: Percentile rank calculations in CB

    Often times we would like to validate the results that we are getting from CB with a hand-calculated version of the model in Excel. The standard way of doing this is to check if the statistics of a forecast match with those we calculate through Excel. It is easy: we can extract all the trial values for a forecast using the 'Extract data' feature, and then use the corresponding Excel function to calculate various statistics on those values. Quite often, we find that the values reported by CB are not matching with those calculated using Excel functions. In this post, I will try to explain the difference for one such function as implemented in Crystal Ball and Excel. Of course, we do not have any visibility in how Excel has implemented these functions. Readers are welcome to provide corrections or references on these comments. Another note: I am using Excel 2003 as the Excel reference implementation in these posts. If you are aware of any difference in Excel implementation in subsequent versions, please leave a comment.

    In this post, we will detail about the function for calculating the  percentile rank of a given number in a ordered dataset. The function to use in Excel for this purpose is PERCENTILERANK(array, x, significance). Quoting Excel help:
    Returns the rank of a value in a data set as a percentage of the data set. This function can be used to evaluate the relative standing of a value within a data set.
    One important aspect of the calculation mentioned in the Excel help page is the following:
    If x does not match one of the values in array, PERCENTRANK interpolates to return the correct percentage rank.
    It is this interpolation method where CB differs from Excel. First off, just like there is no standard definition of percentile, there is no standard definition of percentile rank either. All the formulas available in the literature yield similar results, but not the exact same. In CB, to calculate the interpolation, we use the 1-based index of the values just below and above x in the sorted version of the original array. We do not know how Excel carries out the calculation, but in our tests, final Excel result can differ substantially from CB result.

    Saturday, July 24, 2010

    A different run (Part 4): Multi-threading and seed

    This is the fourth and final part of a series of posts detailing the reasons of having different simulation output when rerunning Monte Carlo simulation models using CB. The other parts can be found here:
    Part 1: CB Excel functions are not tied to seed
    Part 2: Excel RAND() function in CB models - handle with care
    Part 3: Running OptQuest with low-confidence testing at Extreme Speed

    In this last post I will cover two other reasons for getting different results from simulation runs. The first one is non-obvious: we have seen that running optimization in multi-threaded Excel environment, like Excel 2007 can result in difference in the numbers from simulation to simulation. We use Excel engine for calculating the values of the forecasts in the spreadsheet model, and some dependencies in the model when coupled with multi-threaded Excel engine results in this issue. A good overview of the multi-threading option for Excel recalcs can be found here.

    Workaround: If you notice this issue, check if you have multi-threaded calculations enabled. In Excel 2007, you can find this option at Office Button -> Excel Options -> Advanced -> Formulas -> 'Enable multi-threaded calculation'.

    The last one is the obvious one. If you have come this far while debugging why you are getting different results from your spreadsheet each time you run your simulation or optimization model, do not forget to check if you have set the option of using the same sequence of random numbers in the CB 'Run Preferences' dialog. You will find it in the 'Sampling' tab in 'Run Preferences' dialog in CB. Note, you might have used VBA code to alter this option and then have forgotten to reset it.

    Still not able to find out the reason? Contact us for support.

    Friday, July 16, 2010

    A different run (Part 3): Running OptQuest with low-confidence testing at Extreme Speed

    This is the third part of a series of posts detailing the reasons of having different simulation output when rerunning Monte Carlo simulation models using CB. The other parts can be found here:
    Part 1: CB Excel functions are not tied to seed
    Part 2: Excel RAND() function in CB models - handle with care
    Part 4: Multi-threading and seed

    This post deals with a rather obscure setting which can lead to this situation.  In the Options tab in OptQuest, you will find an option called "Enable low-confidence testing" in the "Advanced Option.." dialog box (as shown below).

    As mentioned in the dialog box, this setting improves the optimization time by stopping simulations early if the solution appears to be inferior to the best solution. But this option doesn't quite work as well when the simulation is run in extreme speed. The reason is that, in extreme speed, the random numbers are generated in brusts, and the size of these bursts can vary from simulation to simulation. Naturally, the statistics which gets calculated from the trials can also vary based on these burst size. As an example, let's say that the first time you run an optimization model with this setting enabled, the burst size was 500, and the engine decided to stop early as the solution seemed to be inferior. In the next optimization run, the burst size may be 400, and the trials generated when stopping the simulations might be different.

    Workaround: Do not use this option if you want to reproduce optimization results.

    Wednesday, June 30, 2010

    A different run (Part 2): Excel RAND() function in CB models - handle with care

    This is the second part of a series of posts detailing the reasons of having different simulation output when rerunning Monte Carlo simulation models using CB. The other parts can be found here:
    Part 1: CB Excel functions are not tied to seed
    Part 3: Running OptQuest with low-confidence testing at Extreme Speed
    Part 4: Multi-threading and seed

    I hinted at this in the first post: using Excel RAND() function in a CB model would result in having different result for each simulation run. The RAND() function generates an uniformly distributed random number between 0 and 1 in the cell. This is a volatile function, each instance of the function in the spreadsheet model returns new value each time the worksheet is recalculated. There is no control over how these random numbers are generated: you cannot specify a seed for this function. So essentially, every time you run the simulation model, and the sheet gets recalculated for each trial, you get different values from the RAND() function, and if you have forecasts linked to these values, they will be different in each run. They may have similar look, and might fit to the same probability distribution though, but that is not guaranteed.

    Workaround: Use CB uniform distribution assumption in [0,1] to generate similar random numbers, but now you can control the stream of numbers by using seed. But do not use the CB.Uniform(..) Excel function - as mentioned in the first post, that would lead to the same problem. Use other CB assumptions to generate other random numbers to suit your need.

    One common use of the RAND() function is to generate integers between two integer limits, and use these integers as index in a lookup function to return a different value from a range, essentially for picking a value from a range with replacement. For example, you might want to pick a person from a list of 20 people, and it is fine to pick the same person more than once. In this case, assuming that each person is equally likely to get picked at any pick, you would want to use a discrete uniform distribution. Rather than using a formula  like "=INT(RAND() * 20)+1" in Excel to generate a discrete uniform random number between 1 and 20, use the corresponding assumption from CB. Of course, it is also easy to tweak this scheme for picking a value from a range without replacement.

    Thursday, June 24, 2010

    ISF 2010 Update: Nice experience

    I am into the last day of ISF2010 conference. Overall it was a nice conference. In the past, I have attended conferences like the INFORMS conferences (the annual conference, the practice conference), and the IIE conference, all of which have forecasting tracks. This is the first time I have attended a conference exclusively covering forecasting, so I was not sure what to expect. But I was not disappointed - this conference had lots of nice talks to keep me interested and busy.

    Some of the talks I attended focused on ARCH and GARCH models, ARIMA models, ways to improve forecasting by aggregating different forecast streams, neural network (NN) based forecasting, and ways to aggregate forecasts across different levels. There were also talks featuring classic methods like exponential smoothing and moving average. Learned a few new things, and got a few ideas that I need to try out with CB Predictor.

    I also presented on the forecasting capabilities of CB Predictor. The other speakers in the session were forecasting heavyweights AutoBox and SAS. Of course, we do not compare anywhere in the feature list of what these software offers, but CB Predictor does offer the user-friendly, non-intimidating environment, familiar Excel spreadsheet goodness, and most importantly, the capability of mixing risk analysis using Monte Carlo Simulation. I think that is a strong value proposition, considering the fact that lot of small and medium sized businesses do not have team of dedicated forecasters who work only on forecasting. These companies typically have a few analytic folks, who take care of various analytic duties, including risk analysis and forecasting. CB Predictor suits well for this kind of scenarios.

    My ISF2010 presentation can be found here.

    Update (8/26/2011): Language.
    Update (12/24/2011): Labels.

    Saturday, June 19, 2010

    A different run (Part 1): CB Excel functions are not tied to seed

    We often get asked about CB Monte Carlo simulation models or OptQuest optimization models which produce different result each time a complete simulation or optimization is run. There are a few situations which can result in this. In a series of posts, I will try to explain each reason and possible workaround.

    The first one is actually an interesting one. Most of our power users are aware that we publish a few functions in Excel from CB, which can be used as functions in Excel spreadsheet models. These functions show up when you select 'Crystal Ball' in the 'Insert Function' dialog in Excel. There are a bunch of distribution assumption specific functions there, like CB.Beta(...), CB.Normal(...) etc., which returns random numbers from these distributions. The caveat to be aware of when using these functions is that, they are not tied to the seed value which you can set in CB 'Run Preferences'. That is, irrespective of whether you select to use a specific seed value or not in this preference, these functions will always use new seed value for each evaluation of the function. So, each time you run a simulation, reset it and rerun the simulation, you are going to get new set of values, and hence, the result will be different.

    This type of function result is termed as 'volatile' in Excel, an example of which is the RAND() function. In fact, heads up, using RAND() function in your model also causes the same problem: you get different result every time you run the simulation (I will do a separate short post on that later). These functions return new values each time the worksheet is recalculated.

    Workaround: Use the CB assumptions. Either define them using the GUI or using the Developer Kit. You will get the usual color-coded cells, and these assumptions will listen to the preferences you set in 'Run Preferences' dialog.

    Links to other posts in this series:
    Part 2: Excel RAND() function in CB models - handle with care
    Part 3: Running OptQuest with low-confidence testing at Extreme Speed
    Part 4: Multi-threading and seed

    Wednesday, June 16, 2010

    Session on CB Predictor in ISF 2010


    I will be presenting in International Symposium on Forecasting, 2010, being held at San Diego, CA. The presentation is titled "Introducing Oracle Crystal Ball Predictor: a new approach to forecasting in Excel". The abstract of the talk is as follows:

    In this presentation we will have an overview of the functionality provided by Oracle Crystal Ball's (CB) Predictor engine. In brief, CB Predictor has an intuitive interface for selecting, managing and cleaning data, running multiple forecasting algorithms (including seasonal and nonseasonal models) and regression on large datasets, and a coherent way of presenting and extracting results or generating reports. Some of the innovative features include outlier detection and filling in missing values. We will also have a sneak preview of forthcoming features.

    The talk is scheduled for Monday, June 21st, at the 2:00PM to 3:30PM session. I will post the slides after the presentation. I am also planning to blog on interesting presentations while at the conference.

    Update: Here is a link to the presentation.
    Update (12/24/2011): Labels.

    Thursday, April 22, 2010

    Examples of Monte Carlo simulation

    I often get asked about materials on Monte Carlo simulation, specifically, examples of MC simulation (software neutral). Examples of MC simulation are somewhat hard to come by. The books which deal with MC simulation mostly cover the theoretical aspects of it, and offer a few trivial examples like calculating the area of a circle using random numbers, or numerical integration (although, I agree, that the numerical integration using MC simulation is not a trivial example in general).

    For examples of where and how one can use Monte Carlo simulation, one great resource would be the example library we have in Oracle Crystal Ball. This is how to access these examples: If you don't have access to Oracle Crystal Ball, then you can download a seven-day trial at [http://www.oracle.com/technetwork/middleware/crystalball/downloads/index.html]. Note that, you might have to register or log in.

    Once you have access to the software, then start Oracle Crystal Ball with Microsoft Excel. At this point, if you navigate the Help -> Crystal Ball -> Examples guide, you will find a list of about 50 or so real world examples ranging in different subject areas like finance, manufacturing, supply chain, oil and gas, six-sigma etc., and even one groundwater cleanup model. The examples cover all aspects of the Oracle Crystal Ball software - so one can learn how MC simulation is used with optimization, forecasting etc. Each example discusses the problem, how the problem is mathematically modeled, and how to perform MC simulation using Oracle Crystal Ball. Even if you run out of the trial time, you can still refer to the example problems and the way they are modeled. And, of course, you do not have to necessarily use Oracle Crystal Ball to test the examples. Just use your favorite software or programming language and try out the examples.

    Update (3/16/2011): Edited download link.