A/B Testing: How (not) to Visualize A/B Tests
A/B Testing: How (not) to Visualize A/B Tests
To create entertaining games, we at Metacore need to understand our players’ desires, behavior and characteristics. Our game development philosophy rests on listening to player feedback – and making decisions based on data.
A/B testing, also known as split testing or randomized controlled trial, is a useful tool for comparing versions of something to determine the most data-supported way forward. In 2023, Merge Mansion ran close to 50 experiments – roughly one per week – ranging from tutorial flow improvements to trying out new features.
Experiments enable rapid iterations and optimization – if you can interpret the results correctly. Joonas Kekki, Metacore’s Data Analyst, walks us through common pitfalls in visualizing A/B testing results, providing better alternatives. Read on to avoid errors and get the most out of your results!
Looking at the results the right way
If you’ve ever analyzed A/B tests, you’ve probably seen something along these lines:
Perhaps it had more metrics, maybe a sparkline for trend – or just the KPIs without much indication for uncertainty. Can you tell what’s wrong with the results visualization above? No? Alright.
It blends time periods together, gives no indication of trend, and gives zero context to whether or not you’ve picked an outlier period. That’s for starters. Due to these subtle sins, only few people will question the presentation of results or the underlying data, but focus instead on discussing p-values or effect sizes under the assumption that the estimates have meaningful interpretations.
Why? Because the simplicity of the presentation makes it seem trustworthy. This misplaced trust will then lead to decisions taken based on non-representative metrics.
Spoiler – that’s not great. Why? Imagine there’s a trend in the experiment where one group performs really great in the beginning, but starts to lag behind after a while. For example we tested on introducing seasonal content to new players – such as Halloween-themed events – on their very first day. All metrics were initially superb, since the fresh mechanics and more content gave more opportunities to engage and spend.
However, it quickly became apparent that the experience was overwhelming for players who were only learning the basics of the game. Below, you see a real-world example of this:
If we disregard the time and average over days, our estimate might be positive, near zero or negative depending on the sample sizes – how much time has passed and so on – but it won’t be anywhere near the real value.
Averaging over the experiment period works only if the difference between control and variant is constant, in which case the resulting number has an interpretation. However, we should not practice bad habits, even if they work in specific cases. We are in a particularly tough situation if the trend flips sign and subsequent cohorts are larger than the initial ones. The blended average could have a wrong sign for a long while.
The importance of context
Another problem with blended or calendar-based metrics is that it’s not easy to say if you’ve accidentally picked an outlier time period. Why? There’s no context for the numbers!
Imagine someone buying one lottery ticket and winning without knowing the odds of it – you can’t know if winning is typical or not. If they would buy 10 tickets and win once, it’s likely that winning a lottery is not the most representative example of partaking in a lottery: against all odds, we observe unreasonable unlikely outcomes that are significant in terms of both effect size and statistical evidence. If there’s no context other than one data point, all we can do is just believe it – congratulations on winning the lottery, buddy!
On the other hand, a quick look at a time series shows that this is not a typical day.
Seeing the trend is useful not only in deciding what is a representative time period, but also in evaluating if the experiment should go on. It would be hard to argue that the trend in the first example will reverse, given enough time.
So, it’s better to cut the losses and move on to the next iteration, even if your plan was to wait and see what happens. Or perhaps, after the initial effect, there’s no change and the experiment can be cut short. Or maybe the groups converge and it’ll be hard to say who ends up winning. A quick clarification – when talking about days, we’re not talking about calendar days, but days measured since the participant joined the experiment. These dates will of course then be different for each test subject, depending on when they joined in. A good rule of thumb is to use cohort metrics instead of calendar or blended metrics. But as we all know, exceptions make a rule, so, here we go: sales or other calendar-based events that occur only on a certain date might be an exception, but still, it’s a great idea to evaluate cohort-based outcomes with cohort-based metrics. The below visualization isn’t a great way to look at the trend. Simple-ish as that, right?
One more thing for the road
Data becomes information when acted upon, and after all the sweat-and-salty-tears poured into the analysis, one is tempted to give sound design recommendations based on test results. Just remember that group averages and differences in means are statistical devices.
Depending on the distribution, for example, if Elon Musk jumps into a metro car and the average wealth becomes significantly higher than in the car behind, the averages might not reflect a typical passenger, even if the group difference is real.
Stay tuned for more A/B testing insights from Metacore's Data Analyst Joonas Kekki in the next blog of this series.