## Why square the difference instead of taking the absolute value in standard deviation?

We square the difference of the x's from the mean because the Euclidean distance proportional to the square root of the degrees of freedom (number of x's, in a population measure) is the best measure of dispersion.

That is, when the x's have zero mean $\mu = 0$:

$$
\sigma = \sqrt{\frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}} = \frac{\sqrt{\displaystyle\sum_{i=1}^{n}(x_i)^2}} {\sqrt{n}} = \frac{distance}{\sqrt{n}}
$$

The square root of the sum of squares is the multidimensional distance from the mean to the point in high dimensional space denoted by each data point.

### Calculating distance

What's the distance from point 0 to point 5?

- $5-0 = 5$,
- $|0-5| = 5$, and
- $\sqrt{5^2} = 5$

Ok, that's trivial because it's a single dimension.

How about the distance from point (0, 0) to point (3, 4)?

If we can only go in 1 dimension at a time (like in city blocks) then we just add the numbers up. (This is sometimes known as the Manhattan distance).

But what about going in two dimensions at once? Then (by the Pythagorean theorem we all learned in high school), we square the distance in each dimension, sum the squares, and then take the square root to find the distance from the origin to the point.

$$
\sqrt{3^2 + 4^2} = \sqrt{25} = 5
$$

Visually (see the markdown source of the answer for the code to generate):

### Calculating distance in higher dimensions

Now let's consider the 3 dimensional case, for example, how about the distance from point (0, 0, 0) to point (2, 2, 1)?

This is just

$$
\sqrt{\sqrt{2^2 + 2^2}^2 + 1^2} =
\sqrt{2^2 + 2^2 + 1^2} = \sqrt9 = 3
$$

because the distance for the first two x's forms the leg for computing the total distance with the final x.

$$
\sqrt{\sqrt{x_1^2 + x_2^2}^2 + x_3^2} = \sqrt{x_1^2 + x_2^2 + x_3^2}
$$

Demonstrated visually:

We can continue to extend the rule of squaring each dimension's distance,
this generalizes to what we call a Euclidean distance, for orthogonal measurements in hyperdimensional space, like so:

$$
distance = \sqrt{ \sum\nolimits_{i=1}^n{x_i^2} }
$$

and so the sum of orthogonal squares is the squared distance:

$$
distance^2 = \sum_{i=1}^n{x_i^2}
$$

What makes a measurement orthogonal (or at right angles) to another? The condition is that there is no relationship between the two measurements. We would look for these measurements to be *independent and individually distributed*, (*i.i.d.*).

### Variance

Now recall the formula for population variance (from which we'll get the standard deviation):

$$
\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}
$$

If we've already centered the data at 0 by subtracting the mean, we have:

$$
\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i)^2} {n}
$$

So we see the variance is just the *squared distance*, or $distance^2$ (see above), divided by the number of degrees of freedom (the number of dimensions on which the variables are free to vary). This is also the average contribution to $distance^2$ per measurement. "Mean squared variance" would also be an appropriate term.

### Standard Deviation

Then we have the standard deviation, which is just the square root of the variance:

$$
\sigma = \sqrt{\frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}}
$$

Which is equivalently, the *distance*, divided by the square root of the degrees of freedom:

$$
\sigma = \frac{\sqrt{\displaystyle\sum_{i=1}^{n}(x_i)^2}} {\sqrt{n}}
$$

### Mean Absolute Deviation

Mean Absolute Deviation (MAD), is a measure of dispersion that uses the Manhattan distance, or the sum of absolute values of the differences from the mean.

$$
MAD = \frac{\displaystyle\sum_{i=1}^{n}|x_i - \mu|} {n}
$$

Again, assuming the data is centered (the mean subtracted) we have the Manhattan distance divided by the number of measurements:

$$
MAD = \frac{\displaystyle\sum_{i=1}^{n}|x_i|} {n}
$$

### Discussion

- The mean absolute deviation is about .8 times (actually $\sqrt{2/\pi}$) the size of the standard deviation for a normally distributed dataset.
- Regardless of the distribution, the mean absolute deviation is less than or equal to the standard deviation. MAD understates the dispersion of a data set with extreme values, relative to standard deviation.
- Mean Absolute Deviation is more robust to outliers (i.e. outliers do not have as great an effect on the statistic as they do on standard deviation.
- Geometrically speaking, if the measurements are not orthogonal to each other (i.i.d.) - for example, if they were positively correlated, mean absolute deviation would be a better descriptive statistic than standard deviation, which relies on Euclidean distance (although this is usually considered fine).

This table reflects the above information in a more concise way:

$$
\begin{array}{lll}
& MAD & \sigma \\ \hline
size & \le \sigma & \ge MAD \\
size, \sim N & .8 \times \sigma & 1.25 \times MAD \\
outliers & robust & influenced \\
not\ i.i.d. & robust & ok
\end{array}
$$

### Comments:

Do you have a reference for "mean absolute deviation is about .8 times the size of the standard deviation for a normally distributed dataset"? The simulations I'm running show this to be incorrect.

Here's 10 simulations of one million samples from the standard normal distribution:

```
>>> from numpy.random import standard_normal
>>> from numpy import mean, absolute
>>> for _ in range(10):
... array = standard_normal(1_000_000)
... print(numpy.std(array), mean(absolute(array - mean(array))))
...
0.9999303226807994 0.7980634269273035
1.001126461808081 0.7985832977798981
0.9994247275533893 0.7980171649802613
0.9994142105335478 0.7972367136320848
1.0001188211817726 0.798021564315937
1.000442654481297 0.7981845236910842
1.0001537518728232 0.7975554993742403
1.0002838369191982 0.798143108250063
0.9999060114455384 0.797895284109523
1.0004871065680165 0.798726062813422
```

## Conclusion

We prefer the squared differences when calculating a measure of dispersion because we can exploit the Euclidean distance, which gives us a better discriptive statistic of the dispersion. When there are more relatively extreme values, the Euclidean distance accounts for that in the statistic, whereas the Manhattan distance gives each measurement equal weight.