You can access a significantly larger sample of the platform's content for free by logging in with your Gmail account. Sign in now to explore.

How do you test whether the normality assumption holds? What are some things that you can do if the normality assumption is violated?

First, it is important to highlight a common misconception: the normality assumption refers **to the mean** of a metric of interest \(Y\) **and not** to the distribution of \(Y\). In fact, no most cases, the metric of interest \(Y\) will not follow a normal distribution â€“ however, its average will probably follow a normal through the Central Limit Theorem.

Why is normality important? Because we **implicitly assume normality** when we estimate p-values and t-statistics. If however, the mean of our variable of interest does not follow a normal distribution, then p-vals estimations are not accurate.

The main reason for violating the normality assumption is a smaller-than-needed sample size. In fact, **as the sample size increases, the distribution of the mean \(\bar{Y}\) becomes more normally distributed**.

Rule of thumb for the minim sample size for \(\bar{Y}\) to follow normal is **\(355s^2\)**, where \(s\) is the skewness coefficient of the sample distribution defined as \(s = \frac{E[Y - E(Y)]^3}{[Var(Y)]^{3/2}}\) (p.187 Kohavi, Tang, and Xu (2020))

To **test** if the normality assumption holds, we can run offline simulations: Randomly shuffle treatment and control to generate the null distribution, then compare if this distribution is close to normal through some test (Kolmogorov-Smirnov or Anderson-Daling).

If the normality assumption is violated, this essentially means that the skewness of the data is too high, so some of the things that we could do to reduce it include:

- Cap values, transform variables (log transform) so that you can artificially reduce the skewness of the sample distribution.
- Do a permutation test and see where your observation stands relative to the simulated null distribution
- In a permutation test we constantly re-randomizing observations to treatment and control (while keeping their observed outcomes) and hen we draw the distribution of the Null hypothesis.
- Note that permutations are
**without resampling**!! - Once we have estimated the null distribution,
**we can get the p-value by finding the t-statisticâ€™s point in this Null distribution**, which might be different than normal!!

Normality

- Reducing variance in AB testing Medium (Variance)
- AA tests Easy (Variance)
- Counterfactual definition Easy (Counterfactual)
- Randomization level Medium (Randomization, Variance)
- False discovery control Easy (False discovery rate, Multiple hypotheses testing, Benjamini & Hochberg, Bonferroni)