
A/B testing and Correcting for Multiple Comparisons


In this blog, Metacore’s Data Analyst Joonas Kekki discusses challenges in A/B testing and provides his five (well, actually, more like a hundred) cents on addressing them, focusing on correcting for multiple comparisons from a frequentist perspective, which assesses probability based on the long-run frequency of outcomes rather than subjective beliefs.
Check also his recent blog on common pitfalls in visualizing A/B testing results.
Before we get to the beef, let’s align on some key concepts discussed in the blog, so we are on the same page.
Null Hypothesis Significance Testing (NHST)
A method for determining if there is enough evidence to reject the null hypothesis, which often assumes that a new feature has no effect on player behavior, such as engagement or revenue.
P-value
A measure of evidence against the null hypothesis. It reflects the probability of observing results as extreme as those seen, assuming the null hypothesis is true. In simple terms, it shows how likely the observed results are under the initial assumption.
Type 1 Error Rate
The probability of incorrectly rejecting the null hypothesis, or concluding that a feature affects player behavior when it doesn’t (false positive). Typically set at 5%, meaning p-values of 5% or smaller are needed to reject the null hypothesis.
Family-wise Error Rate
The probability of making at least one Type I error across a set of related hypotheses (the joint hypothesis). For example, when multiple metrics like retention and conversion are tested, this rate reflects the chance of a false positive across any of them.
Aaand, there’s more – a few taxonomy disclaimers
In order to discuss multiple testing, we need to know what we mean by that. Unfortunately, nobody seems to agree what it means.
As professor Miguel García-Pérez (2023) put it: “Despite the ubiquity of the term, no clear definition seems to be available in reference sources as to what practices represent the type of multiple testing that demands control of Type-I error rates”.
We need something in order to proceed, though, so I’m going to use Donald Rubin’s taxonomy from his paper “When to Adjust Alpha During Multiple Testing”.
According to Rubin, there are three cases to consider.
- Any significant result leads to rejecting the joint null hypothesis. For example, having two metrics – say, retention and conversion – and rejecting hypotheses if either one is significant. He calls this the disjunction testing, but we’ll call it “hey, at least one is significant” testing.
- All results need to be significant before the joint null hypothesis is rejected. Continuing the previous example: both retention and conversion need to be significant in order to reject the hypothesis. Let’s call this conjunction testing or “all-significant” testing.
- No claims are made about the joint null hypothesis and each assumption is considered separately. For example, retention is significant and this is independent from conversion being significant or not. This is called individual testing.
Now, let’s examine four common scenarios to see if we can get some clarity when family wise correction is warranted and when it’s detrimental to our case.

Whether or not you should use family-wise correction depends on your specific goals
Checking trends over time and calculating p-values or confidence intervals for each time leads to multiple hypotheses. Should we use family wise correction in order not to have an inflated type 1 error?
It depends on what you’re going to do with the information. Choosing to reject the joint null hypothesis if at least one time step shows significant test results necessitates correcting for the family-wise error. This seems like cherry-picking the largest value, leading to inflated estimates that family-wise correction can't fix.
Rejecting the joint null hypothesis only if all time steps have significant results avoids concerns about inflated Type I error, as there's only one chance to reject. However, this approach risks insufficient power, making it harder to reject the joint null hypothesis. For instance, with two time periods, power drops to 64%, and with more periods, it declines even further.
Treating time steps independently removes the need for family-wise error correction, as we're not estimating a joint hypothesis. This allows for concluding significance at one time step but not another, which makes sense, as we often don't know the trend's shape, and differences can vary over time.
For example, in my previous blog post, we discussed how fast seasonal content should unlock for new players in Merge Mansion. The effect on retaining players was positive and statistically significant in the very first days, but soon it turned negative, again being statistically significant. We analyzed the time steps independently, as it would be silly to green-light the change based on one positive day. It would be equally silly to correct for multiple hypotheses, as if day 2 and day 14 had to have the same result. The best course of action is to acknowledge that this change was non-linear as a function of time and to draw lessons from it for the next design.

Evaluating experiment results using multiple metrics
What about using multiple metrics to judge the results of an experiment?
"At-least-one-significant" testing: This approach is useful when any significant result in any metric is sufficient. However, it's important to be cautious, as some metrics may be prioritized over others. If chosen, it's necessary to correct for the number of metrics.
"All-significant" testing could also be a sound approach. Perhaps we're altering something fundamental in the product and want to ensure it's truly worth it, even at the risk of a false negative result. In this case, no need for correction!
Individual testing: My preferred approach, as it's simpler to analyze metrics separately. For instance, increased engagement may boost monetization, but this assumption requires careful consideration.
In experiments, we aim to improve key metrics while keeping others constant. Some metrics serve as guardrails to protect against negative outcomes, while key metrics measure success. Ideally, all key metrics improve without affecting guardrails, but real-life decisions often involve trade-offs.
Unfortunately, statistical corrections don't help weigh these decisions, requiring judgment beyond data.
Aside from business and behavioral outcomes, pre-test metrics control for baselines, intermediate metrics help validate assumptions, and QA metrics ensure variables behave as expected.
Remember – not all metrics are created equal. It's crucial to approach them thoughtfully.
Say we have many experiments running simultaneously. The hypotheses for each interaction would naturally differ. One interaction might be due to a red color, another due to a combination of starting balance and lower cost enabling a purchase. It’s hard to argue that we’re testing a joint hypothesis here, so I wouldn’t correct for family-wise errors.
Multiple experiments within the same interval
Now that we’ve got the engine running, what about running multiple experiments within the same time interval or between the same players?
There should be no need for family wise correction – unless you truly are testing a joint null hypothesis! For example, if you opt to repeat the same test until you achieve significant results.
A scenario of simultaneous experiments about the same hypothesis isn’t really practical, as there might be instances where two tests modify the same parameter or UI component, making it impossible to run them simultaneously. Concerns such as server bandwidth limitations or other implementation issues may arise, but family wise corrections should not be among them.
However, the overlap from simultaneous experiments can create a situation where we care about interaction between the different groups. Say we have many experiments running simultaneously. The hypotheses for each interaction would naturally differ. One interaction might be due to a red color, another due to a combination of starting balance and lower cost enabling a purchase. It’s hard to argue we’re testing a joint hypothesis here, so I wouldn’t correct for family-wise errors.
While it's true that with more tests, the chance of at least one Type I error increases, this is irrelevant for unrelated tests. Adjusting would be like imposing stricter alpha levels on recent experiments as experiments accumulate over time, which isn't reasonable.
Navigating multiple tests and family-wise errors
It’s easy to argue that if you're testing four groups on different values of the same parameter and declare success based on statistical significance, you're engaging in “hey, at least one’s significant” testing without individual hypotheses and should correct for family-wise error.
Demanding that all treatment groups show significant results is theoretically possible if we want to be absolutely certain, but I’ve never encountered this in practice.
On the other hand, I could be convinced that each combination has its own unique hypothesis, in which case we should treat them as separate tests. For example, one group might prevent selling important items, while another gives a warning before selling. Are we testing a joint hypothesis that selling items is harmful? Or are we testing separate hypotheses about how hard it is to discard important items?
Even when varying a single parameter, like starting currency, I might disregard family-wise error correction. If we randomize starting currency amounts for players, should we adjust the alpha level based on the number of players? Or should we treat the data as evidence for a single group and choose the optimal value after modeling, using a conventional alpha level?
Ultimately, it depends on how you treat p-values. If p-values are the main evidence for decisions, correcting for family-wise errors makes sense. But if they're just one factor among metrics, behavior trends, causal explanations, and player feedback, then strict correction is less crucial. In such cases, I prefer to leave confidence intervals as they are and use p-values as one measure of uncertainty.
Nuanced approach is the best approach
The end is near. Are we any wiser?
Perhaps. Having a framework for thinking about joint null hypothesis is useful for multiple comparisons.
Deep down, the whole NHST business seems somewhat arbitrary—demanding more or less evidence based on criteria that aren’t clear doesn’t seem like a productive path to truth. I personally prefer to use the same confidence interval width for all comparisons; my intuition about what looks too-good-to-be-true works. In case a second, more strict threshold is needed, that can be plotted in addition to the usual suspect. This logic applies not only for multiple corrections, but also for predetermined bounds that differ from the convention.
That said, I do calculate the per-family error rate, α × k, to get a sense of what’s expected due to randomness, even at the risk of falling into an ecological fallacy. For example, with 20 tests and an alpha of 5%, we’d expect one group to be significant just by chance. Observing exactly the expected number of significant results doesn’t lead to dismissing them as flawed, but it does prompt extra caution and a careful examination of other evidence.
To conclude, you could say that I'm not a firm believer of NHST and prefer to look at the evidence at large. Even for a lover of dichotomies, thinking around this subject can remain hazy, given the many contradictory claims about it.
This time, let’s conclude with the words of Garcia-Pérez:
“In most cases, the research question doesn’t require testing whether all means are equal. Usually, there's a suspicion that some specific means differ, so the focus should be on pairwise comparisons. In such cases, it's better to skip the ANOVA and directly test the comparisons at α, without adjusting for multiple tests, since the same hypothesis isn’t being tested repeatedly.”
Joonas Kekki is a Data Analyst at Metacore.
Related content
