First, it is important to highlight a common misconception: the normality assumption refers
to the mean of a metric of interest
\(Y\) and not to the distribution of
\(Y\). In fact, no most cases, the metric of interest
\(Y\) will not follow a normal distribution – however, its average will probably follow a normal through the Central Limit Theorem.
Why is normality important? Because we implicitly assume normality when we estimate p-values and t-statistics. If however, the mean of our variable of interest does not follow a normal distribution, then p-vals estimations are not accurate.
The main reason for violating the normality assumption is a smaller-than-needed sample size. In fact, as the sample size increases, the distribution of the mean \(\bar{Y}\) becomes more normally distributed.
Rule of thumb for the minim sample size for
\(\bar{Y}\) to follow normal is
\(355s^2\), where
\(s\) is the skewness coefficient of the sample distribution defined as
\(s = \frac{E[Y - E(Y)]^3}{[Var(Y)]^{3/2}}\) (p.187
Kohavi, Tang, and Xu (2020))
To test if the normality assumption holds, we can run offline simulations: Randomly shuffle treatment and control to generate the null distribution, then compare if this distribution is close to normal through some test (Kolmogorov-Smirnov or Anderson-Daling).
If the normality assumption is violated, this essentially means that the skewness of the data is too high, so some of the things that we could do to reduce it include:
- Cap values, transform variables (log transform) so that you can artificially reduce the skewness of the sample distribution.
- Do a permutation test and see where your observation stands relative to the simulated null distribution
- In a permutation test we constantly re-randomizing observations to treatment and control (while keeping their observed outcomes) and hen we draw the distribution of the Null hypothesis.
- Note that permutations are without resampling!!
- Once we have estimated the null distribution, we can get the p-value by finding the t-statistic’s point in this Null distribution, which might be different than normal!!