Why two-sided testing is reducing your A/B testing program’s impact by 25%

Dominic Sando
9 min readJul 1, 2020

--

The one-sided vs two-sided debate has been one of the most fiercely fought of 20th-century. The feud began in the fifties when psychologists Marks (1951) and Jones (1952) squared off with Burke (1953) and Hick (1952). Jones believed one-sided was the only methodology required. Burke said it could lead to no real learnings and we must instead always use two-sided tests. I like to think they settled their dispute in a back alley outside a remote local university bar. But on paper neither side ever conceded.

Jones’ and Burke’s vigorous debate. (Image taken from SNL).

These debates have continued unresolved to the 21st century. Recent proponents of one-sided tests include Georgi Georgiev, Daniel Lakens, Krueger (2011), and Cho & Abe (2013). But these proponents face into a tide of detractors of one-sided tests. To name just a few in both the academic, optimisation agency, and blogging worlds: Lombardi & Hulbert (2009), Invespcro, and Jim Frost.

The reason these debates have been unresolved for decades is that the answer to which test you should use is a firm “it depends”. There is no universal truth.

Academics may lean towards using two-sided tests, while online experimenters may lean heavily towards using one-sided. Both will have excellent reasons for doing so. What is important is aligning the inferences you’d like to make with the best-suited test to support or invalidate those inferences. You must make smart decisions around whether to use one-sided or two-sided test given the goals of your research or experimentation program.

And when it comes to online experimentation programs, in 99% of cases, it is better to use one-sided tests.

One-sided tests have clear efficiency benefits vs two-sided ones, requiring 20%-30% less data to detect a given positive MDE than a two-sided counterpart. And even though one-sided tests cannot ‘document’ significant negative learnings, the times where it’s valuable to do this are rarer than you think.

So if you’re using two-sided tests regularly, you may likely be producing 25% less value over the long-term due to reduced testing speed! Read on to find out why.

Note: if you aren’t familiar with the differences between one- and two-sided tests, this article provides a quick summary.

Identifying significant negative results isn’t often that valuable.

In A/B testing, we are always trying to create a positive effect. Consider two hypotheses by a Growth experimenter:

  • The classic hypothesis: By changing variable X, we will increase CVR / Retention or decrease Churn / Pausing.
  • The mad hypothesis: By changing variable X, we will either increase or decrease CVR / Retention / Churn / Pausing.

I have never seen a hypothesis which looks like the latter. We are not in the business of creating changes that do something to customer experience. We want to create changes that will improve customer experience. As because of this goal, our statistical tests only need to provide support for whether or not a new variant is likely to improve a metric in a given direction.

Thus, the universal accepted decision-making rationale for using one-sided tests is satisfied: In nearly all cases, if the test performs either the same or worse than the control, we wouldn’t roll it out. So in terms of decision-making criteria, we are only looking to prove μ₂ > μ₁: the one-sided alternate hypothesis. If μ₂ < μ₁ or μ₂ = μ₁, we wouldn’t roll out the variant.

“But wait!” The purists cry. “You’ve missed one last important factor!”

“What about if we miss a significant result in the opposite direction? Might that impact our future learnings?”

I do not believe that using one-sided tests will, for two important reasons.

1. Significant negative results rarely give learnings that you can build upon in future testing

How often can significant negative results give us concrete learnings we can build on in the future? For nearly all of the tests we run, they cannot. In each of the following hypothetical examples, there are no clear follow-up tests when we observe a significant negative result.

  1. We add new education about our product to our very simple signup flow
  2. We introduce a new Abandoned Cart email journey
  3. We completely redesign our homepage and test it against the old one

Tests 1. and 3. are in the testing category of adding an entirely new feature or experience to a previously empty journey. If that feature creates a negative result, we would just not roll it out. There is nothing more we can remove.

Let’s say we need to know the breed of dog to make our product properly, so this step in the signup flow is necessary. Our recent test was to add the paragraph at the top to explain why customers are required to put in this information. If the test ended up reducing Signup CVR, there’s no more information we can take away. Taken from Tails.com signup flow

Test 2. is in the testing category I like to call “too many variables, what the *** made the difference”? If a new homepage doesn’t work, was it the new colour scheme that caused it? The prominence of the search bar? Or was it the confusing new layout? How can we conclude what affected CVR given there were so many interacting variables?

If the Airbnb’s old page performed better than the new page, what exactly could we infer? Do we reason that people love nostalgic 2005-era web-design? Taken from their excellent blog here.

There are maybe one or two rare exceptions where you may want to use two-sided tests.

  1. The costs or barriers to testing in the negative direction are high. Let’s say you think the latest Brand refresh might hurt, not improve, conversions on your Homepage. You might need to produce stringent (i.e. statistically significant) proof that their newest design doesn’t work if you want to test removing their new designs elsewhere on your website, as you’ll be upsetting a lot of people by doing so. (Example purely illustrative — our Brand team is amazing).
  2. You are experiencing severe pressures from HiPPOs to validate a particular feature. Demonstrating that what you’re testing not only not improving things, but is significantly harming a key success metric might help convince stakeholders to leave you alone.

2. In business, nobody needs you to “significantly prove” your negative result.

In academia, the community might find your very negatively significant result fascinating and a source of ground-breaking new research. They often value any significant effect, negative or otherwise. Tens or hundreds of academic papers usually validate broader theories by demonstrating different types of outcomes under different test conditions.

On the other hand, when you take your CEO your significantly negative test result, hoping for recognition, I’m sure she’ll take it as proof your ideas suck.

Using significance thresholds, academics “bank” learnings, while online experimenters “bank” results.

As we’ve discussed above, when you roll out a change to a website or app, it’s crucial to be sure that you’re improving the customer experience. So meeting statistical significance thresholds is vital in this respect.

When a test is positively significant, you roll it out and “bank a result”. By surpassing your defined significant threshold, you’ve minimised the risks that you haven't changed anything or made things worse. (Let’s not get too sucked into p-value definitions at this time).

But if a test is negatively significant, making a critical success metric worse, you don’t roll it out. You instead use it to inform future experiments. And in these cases, are pre-defined ‘significance thresholds’ that necessary for driving next steps? We can still use negative performances of variants to inform future testing, whether or not they are identified as significant. Running another experiment is considerably less risky than rolling out a permanent change, so our tolerance false-positive negative result can be considerably lower.

Academics often run one- or two-sided tests on others’ published papers to see if they would change conclusions. In a similar vein, when we see a sizeable negative result after running a one-sided test, we run a one-sided test the other way to get a sense of how significant the effect would have been if we had been looking that way. While we cannot conclude a statistically valid result (due to inflated false-positive rates), we can use this information to inform further testing.

With the dummy data above, we cannot reject the null, and therefore cannot say with statistical validity that the test performed worse than the control. But we can infer that whatever we tested likely reduced CVR.

Positive test results require statistically valid testing procedures to give us confidence in rolling them out. We can treat negative test results as data like any other. And even without bringing statistical significance to bear, observed data is probably a far-sight more accurate than certain types of qualitative research that we use to inform tests.

The efficiency benefits of one-sided tests are worth it: they are 25% faster to run than two-sided ones.

One of the final reasons advocates of two-sided tests prefer them to one-sided is due to claims that they are more ‘stringent’. However, it is this over-stringency of testing that may be slowing down your program unnecessarily.

It is true that using a 95% two-sided test leads to a lower false-positive rate for successes than a 95% one-sided test. In a two-sided test, you allocate half of your alpha (the probability of a false-positive) to each side. You only allot 2.5% to the right-hand side of the curve. In a one-sided test, all of the alpha is allotted, so the whole 5% is on the right-hand side.

Two-sided and one-sided tests. Taken from UCLA.edu.

So yes you decrease your false-positive rate, but why do you need a 97.5% significance threshold? That high bar is unnecessarily stringent, given the risk appetite of most startups and limited downsides of most online tests. In another post, I’ve argued that for ostensibly positive UX changes, we should use 90% significance tests.

Setting the bar so high leads to a much slower rate of testing that can be detrimental to the impact of your experimentation program. Using a one-sided test instead of a two-sided test for a given significance level and MDE will typically reduce your required sample size by 20%-30%.

To put another way, if you run a two-sided test every month to try to detect a given MDE, you will be able to run 13 tests in a given year. But if you switch to one-sided, you can run 17! Those extra four tests will be gold dust to both create value and learnings.

Moving to a brave new one-sided world

One-sided tests are significantly more efficient to run than two-sided tests. And since we are in the business of producing positive changes to customer experience, negative results are either not that useful or not required to be significant to inform future testing. Both factors render the multi-directional power of two-sided tests useless, given the goals of online experimentation programs.

There may be some that continue to disagree, but it’s worth reigniting the discussion. The benefits of switching to using only one-sided tests are so clear that we have to rigorously question the assumptions behind those who do not use them in nearly all instances. Particularly when our context so vastly differs from the world of academia that initially created the statistical tools we use today.

Wishing you an even speedier testing program in the coming year!

I hope you enjoyed the post. If you want to get updated when a new post comes out, sign up to my mailing list. Takes less than 5 seconds!

I’d also like to give a special shoutout to Georgi Georgiev. His amazingly well-researched and compelling articles at one-sided.org helped me to understand the importance of one-sided testing. And his detailed comments on early drafts made have made my blog posts 2x better. To the others who read drafts of these posts too, thank you so much too!

--

--