Statistical Dispersion

Statistical dispersion is a term referring to how large or small the range of values is for a particular variable; there are many different measures of dispersion, including variance, standard deviation, and interquartile range.

The concept of dispersion includes that of scatter, or variability. It can be measured by the extent of differences each datum value has from each of the other values. In any data set of reasonable size, this would lead to very many differences to be calculated, so instead it is more straightforward to calculate the difference of each value from a central value, namely the mean. Say that the sample data that are in hand are x1, x2, ... xn and that the mean of all these data is . It is possible to calculate what is called the deviation from the mean: ’ where i = 1, 2’...,n. It might be thought that the next step would be to work out the average of these deviations, but because some will be negative and others positive, it turns out that this average would be zero. Square the deviations to get rid of the negative values and take the average of the results. This leads to a well-known quantity called the variance. This is a well-used and popular measure of dispersion. Similarly, take the positive square root of the variance to obtain the standard deviation.

This is done because the units of the variable of interest are squared in the case of the variance. Taking the square root will return the units to that of the variable of interest. The formulae for data originating as a sample that are not grouped in any way are:

98418347-97678.jpg

Note that when dealing with samples, the denominator (n – 1), not n, gives a better estimate of the population standard deviation in the long run. (The population standard deviation is calculated with n in the denominator.) The standard deviation does not change despite any change in the origin. It does not change even after addition or subtraction of each datum value by some constant. So standard deviations of 2, 3, 7, 9 are the same as 3, 4, 8, 10, adding a constant of 1. If, however, each datum value is divided by 3, the standard deviation is likewise divided by 3.

Mean(x + y) = mean(x) + mean(y). Variance or SD works differently. The SD of the sum or difference of the values is not the sum or difference of the SDs. In a specific situation when x and y are linearly independent, variance(x + y) = variance(x) + variance(y). In general, however, variance(x + y) = variance(x) + variance(y) + 2 × covariance(x, y), where covariance(x, y) = )/(n – 1). This is the sum of product of deviations and is applicable when each value of x has correspondingly one value of y. With xy, covariance becomes negative.

In the case of skewed data, standard deviation is unlikely to be the best measure of dispersion. Here, use the inter-quartile range, covering the middle 50% of values.

Bibliography

Chambers, Robert G., et al. "A Two-Parameter Model of Dispersion Aversion." Journal of Economic Theory 150 (2014). Print.

Manikandan, S. "Measures of Dispersion." Journal of Pharmacological Pharmacotherapeutics 2. 4 (2011). Print.

Moore, P.G. Principles of Statistical Techniques. Cambridge: Cambridge UP, 2010. Print.

Sundaram, K. R., S. N. Dwivedi, V. Sreenivas. Medical Statistics Principles and Methods. New Delhi: B.I., 2010. Print.