Saturday, September 11, 2010

Validation with Excel: Percentile rank calculations in CB

Often times we would like to validate the results that we are getting from CB with a hand-calculated version of the model in Excel. The standard way of doing this is to check if the statistics of a forecast match with those we calculate through Excel. It is easy: we can extract all the trial values for a forecast using the 'Extract data' feature, and then use the corresponding Excel function to calculate various statistics on those values. Quite often, we find that the values reported by CB are not matching with those calculated using Excel functions. In this post, I will try to explain the difference for one such function as implemented in Crystal Ball and Excel. Of course, we do not have any visibility in how Excel has implemented these functions. Readers are welcome to provide corrections or references on these comments. Another note: I am using Excel 2003 as the Excel reference implementation in these posts. If you are aware of any difference in Excel implementation in subsequent versions, please leave a comment.

In this post, we will detail about the function for calculating the  percentile rank of a given number in a ordered dataset. The function to use in Excel for this purpose is PERCENTILERANK(array, x, significance). Quoting Excel help:
Returns the rank of a value in a data set as a percentage of the data set. This function can be used to evaluate the relative standing of a value within a data set.
One important aspect of the calculation mentioned in the Excel help page is the following:
If x does not match one of the values in array, PERCENTRANK interpolates to return the correct percentage rank.
It is this interpolation method where CB differs from Excel. First off, just like there is no standard definition of percentile, there is no standard definition of percentile rank either. All the formulas available in the literature yield similar results, but not the exact same. In CB, to calculate the interpolation, we use the 1-based index of the values just below and above x in the sorted version of the original array. We do not know how Excel carries out the calculation, but in our tests, final Excel result can differ substantially from CB result.