Thursday, April 7, 2011

Triangular distribution: the mean has a range

I was trying to decide on a distribution for a few small datasets (18 points at most). Specifically, I thought triangular distribution would be a good choice, since what we had was limited amount of data, and they were always skewed, mostly towards left. Triangular distribution is general enough to capture the full range of risk in the data. But rather than trying to fit the data directly to a triangular distribution, I thought I will use a parametrized fit, in which I will calculate the triangular fit from the minimum, maximum and average of the data. The rationale was that, since I did not have enough datapoints, a direct fit would not be great anyways, and since I was well set on using triangular, a parametric fit would work better.

Given the minimum, maximum and average of a dataset, finding the triangular parameters minimum (a), most likely (b) and maximum (c) are easy. The average (mean) of the distribution is given by: 
$\mu = \frac{a + b +c}{3}$
So, given $a, c$ and $\mu$, we can calculate b:
$b=3\mu - a - c$

While applying the formula to the statistics from a few datasets, it soon became apparent that the triangular distribution definition is not going well for some of them. For example,
$\min=0, \max=105, $ average=$37.4 \Rightarrow \textrm{most likely}=7.333$
$\min=5, \max=175, $ average=$54.3 \Rightarrow \textrm{most likely}=-17.1$
Now, the most likely value from the last example cannot be used, since it is lower than the minimum. In fact, while checking the math, I realized that a triangular distribution cannot support arbitrary average values, and just because the average is in between the minimum and maximum value does not mean that it could be supported by a triangular distribution with those minimum and maximum. The range of average that can be supported by a triangular distribution with minimum a and maximum c can be calculated easily - it is given by:
Range = $[\frac{2a+c}{3}, \frac{a+2c}{3}]$. 
If the average value from the data is not within this range, it cannot be used for defining a triangular distribution.

That made me realize, that my initial assumption of using triangular distribution for these datasets is wrong in itself. Further research suggested that the best distributions to use for these datasets are beta, Weibull or Lognormal, and indeed triangular seemed to be towards the end of the list, sometimes even below normal distribution.

Note: I have started using MathJax to render mathematical formulas in blog posts. If you see strange words in the blogs, either you are using dedicated readers which might have trouble rendering these formulas correctly, or the browser page needs to be refreshed. If you are using a dedicated reader, please visit the post website to view the post in a browser.

No comments:

Post a Comment