Some of the reasons that SRM can happen include:
Page redirects for treatment (e.g., the treatment is implemented through web page redirects that take significantly longer)
Bad hash randomization (more generally buggy code of randomization)
If the conditions for treatment triggering are influenced by the experiment (more generally, bad trigger conditions can lead to imbalance)
Data pipeline logging, e.g. removing users who are inactivated or deemed bots
Time of day treatment occurs for test and control can bias metric measurement
We can test for SRM through a chi-squared test, where the null is that the SR = 1. Consider for instance that we split 1000 users evenly but the actual groups are 550 - 450. We can estimate the \(\chi^2\) statistic as follows:
\[ \chi^2 = \sum_i \frac{(O_i - N p_i)^2}{Np_i} = \frac{(550-500)^2 + (450-500)^2}{500} = 10 \] The value of the \(\chi^2\) statistic is too large for it to have come from the Null, and hence we can reject the Null.
from scipy.stats import chisquare
observed = [550,450]
expected = [500,500]
chi = chisquare(observed, f_exp=expected)
print(f"chi squared statistic: {chi[0]} \np-val: {chi[1]:.3f}")
## chi squared statistic: 10.0
## p-val: 0.002