Chap2 An end-to end example of online experiments
First define hypothesis, then define OEC, third determine which users to consider in the denominator of OEC.
2.1. Hypothesis Testing: Establishing Statistical Significance
- P-value <= 0.05: if there is truly no effect, we can correctly infer there is no effect 95 out of 100 times
- Confidence Interval overlaps with 0: A 95% CI is the range that covers the true difference 95% of the time and for fairly large sample sizes it is usually centered around the observed delta between the Treatment and Control with an extension of 1.96 standard errors on each side.
- Statistical power 80~90% : the probability of detecting a meaningful difference between the variants when there is really one.
2.2. Designing the Experiment
We will use this set of decisions to finalize the design:
- What is the randonmization unit?
- What population of randomization units do we want to target?
- How large(size) does our experiment need to be?
- How long do we run the experient?
Example:
2.3. Running the experiment and getting data
2.4. Interpreting the results
2.5 From the results to decision
1. Not statistically significant. Abandon.
2. Statistically and practically significant, launch!
3. Statisticallly but not practically significant, no launching.
4. CI is too wide, reccoment running a follow-up tests with larger sample size.
5. Practically but not statistically significant. repeat with greater power.
6. Possibly practically and statistically significant. Suggest repeat with greater power. From a launch/no-launch decision, however, choosing to launch is a reasonable decision.
说明 · · · · · ·
表示其中内容是对原文的摘抄