Suppose that:
- A normally distributed random variable
is being sampled.
- There are k partitions, each with
samples (where
).
- For each partition, the mean
and variance
are known, but the original observations
are not available.
- The overall mean
and variance
are desired.
Computation of and
are straightforward:
[1]
[2]
However, because the overall mean is not available within the partition where
is sampled, the formula for the variance:
[3]
must be rewritten as
[4]
This formula may be derived as follows.
First introduce into the formula for
[5]
Then, simplify to remove from the partition summation
:
Apply
[6]
Replace with
[7]
Replace with
[8]
Distribute
[9]
Apply ;
express summation of sum as sum of summations
[10]
Because neither nor
depends on j
[11]
Apply ;
distribute ;
express summation of sum as sum of summations;
simplify
[12]
[13]
[14]
Apply ;
simplify
[15]
[16]
Express summation of sum as sum of summations;
simplify
[17]
Because does not depend on j, factor out
[18]
Apply and
;
simplify
[19]
[20]
Reference
This post was largely inspired by
http://stats.stackexchange.com/questions/10441/how-to-calculate-the-variance-of-a-partition-of-variables/10445#10445
which tantalizingly ended with the (under-)statement:
These formulas are easy to derive by writing the desired variance as the scaled sum of , then introducing
:
, using the square of difference formula, and simplifying