hypotheses

Statistical Hypothesis Testing: A Simple Guide to Smarter A/B Tests

Conversion Optimization, CRO Tests | Multivariate | AB Testing

A fundamental concept in A/B testing is statistical hypothesis testing. It involves creating a hypothesis about the relationship between two data sets and then comparing these data sets to determine if there is a statistically significant difference. It may sound complicated, but it explains how A/B testing works.

In this article, we’ll take a high-level look at how statistical hypothesis testing works, so you understand the science behind A/B testing. If you prefer to get straight to testing, we recommend our turnkey conversion optimization services.

Null and Alternative Hypotheses

In A/B testing, you typically start with two types of hypotheses:

First, a Null Hypothesis (H0). This hypothesis assumes that there is no significant difference between the two variations. For example, “Page Variation A and Page Variation B have the same conversion rate.”

Second, an Alternative Hypothesis (H1). This hypothesis assumes there is a significant difference between the two variations. For example, “Page Variation B has a higher conversion rate than Page Variation A.”

Additional reading: How to Create Testing Hypotheses that Drive Real Profits

Disproving the Hypothesis

The primary goal of A/B testing is not to prove the alternative hypothesis but to gather enough evidence to reject the null hypothesis. Here’s how it works in practical terms:

Step 1: We formulate a hypothesis predicting that one version (e.g., Page Variation B) will perform better than another (e.g., Page Variation A).

Step 2: We collect data. By randomly assigning visitors to either the control (original page) or the treatment (modified page), we can collect data on their interactions with the website.

Step 3: We analyze the results, comparing the performance of both versions to see if there is a statistically significant difference.

Step 4: If the data shows a significant difference, you can reject the null hypothesis and conclude that the alternative hypothesis is likely true. If there is no significant difference in conversion rates, you assume the null hypothesis is true and reject the alternative hypothesis.

Example

To illustrate this process, consider an example where you want to test whether changing the call-to-action (CTA) button from “Purchase” to “Buy Now” will increase the conversion rate.

Null Hypothesis: The conversion rates for “Purchase” and “Buy Now” are the same.
Alternative Hypothesis: The “Buy Now” CTA button will have a higher conversion rate than the “Purchase” button.
Test and Analyze: Run the A/B test, collecting data on the conversion rates for both versions.
Conclusion: If the data shows a statistically significant increase in conversions for the “Buy Now” button, you can reject the null hypothesis and conclude that the “Buy Now” button is more effective.

Importance of Statistical Significance in A/B Testing

Statistical significance tells you whether the results of a test are real or just random.

When you run an A/B test, for example, and Version B gets more conversions than Version A, statistical significance tells you whether that difference is big enough (and consistent enough) that it likely didn’t happen by chance.

It’s the difference between saying:

“This headline seems to work better…”

vs.

“We’re 95% confident this headline works better—and it’s worth making the change.”

In simple terms:

✅ If your test reaches statistical significance, you can trust the results.
❌ If it doesn’t, the outcome might just be noise—and not worth acting on yet.

We achieve statistical significance by ensuring our sample size is large enough to account for chance and randomness in the results.

Imagine flipping a coin 50 times. While probability suggests you’ll get 25 heads and 25 tails, the actual outcome might skew because of random variation. In A/B testing, the same principle applies. One version might accidentally get more primed buyers, or a subset of visitors might have a bias against an image.

To reduce the impact of these chance variables, you need a large enough sample. Once your results reach statistical significance, you can trust that what you’re seeing is a real pattern—not just noise.

That’s why it’s crucial not to conclude an A/B test until you have reached statistically significant results. You can use tools to check if your sample sizes are sufficient. By making these refinements, the text becomes more concise, clear, and easier to follow.

While it appears that one version is doing better than the other, the results overlap too much.

Additional Reading: Four Things You Can Do With an Inconclusive A/B Test

How Much Traffic Do You Need to Reach Statistical Significance?

The amount of traffic you need depends on several factors, but most A/B tests require at least 1,000–2,000 conversions per variation to reach reliable statistical significance. That could mean tens of thousands of visitors, especially if your conversion rate is low.

Here’s what affects your sample size requirement:

Baseline conversion rate – The lower it is, the more traffic you’ll need.
Minimum detectable effect (MDE) – The smaller the lift you want to detect (e.g., a 2% increase), the more traffic is needed.
Confidence level – Most tests aim for 95% statistical confidence.
Statistical power – A standard power level is 80%, which ensures a low chance of false negatives.

Rule of thumb: If your site doesn’t get at least 1,000 conversions per month, you may struggle to run statistically sound tests—unless you’re testing big changes that could yield large effect sizes.

How A/B Testing Tools Work

The tools that make A/B testing possible provide an incredible amount of power. If we wanted, we could use these tools to make your website different for every visitor to your website. The reason we can do this is that these tools change your site in the visitors’ browsers.

When these tools are installed on your website, they send some code, called JavaScript, along with the HTML that defines a page. As the page is rendered, this JavaScript changes it. It can do almost anything:

Change the headlines and text on the page.
Hide images or copy.
Move elements above the fold.
Change the site navigation.

When testing a page, we create an alternative variation of the page with one or more elements changed for testing purposes. In an A/B test, we limit the test to one element so we can easily understand the impact of that change on conversion rates. The testing tool then does the heavy lifting for us, segmenting the traffic and serving the control (or existing page) or the test variation.

It’s also possible to test more than one element at a time—a process called multivariate testing. However, this approach is more complex and requires rigorous planning and analysis. If you’re considering a multivariate test, we recommend letting a Conversion Scientist™ design and run it to ensure valid, reliable results.

Primary Functions of A/B Testing Tools

A/B testing software has the following primary functions.

Serve Different Webpages to Visitors

The first job of A/B testing tools is to show different webpages to certain visitors. The person who designed your test will determine what gets shown.

An A/B test will have a “control,” or the current page, and at least one “treatment,” or the page with some change. The design and development team will work together to create a different treatment. JavaScript must be written to transform the control into the treatment.

It is important that the JavaScript works on all devices and in all browsers used by the visitors to a site. This requires a committed QA effort. At Conversion Sciences, we maintain a library of devices of varying ages that allows us to test our JavaScript for all visitors.

Split Traffic Evenly

Once we have JavaScript to display one or more treatments, our A/B testing software must determine which visitors see the control and which see the treatments.

Typically, every other user will get a different page. Visitors are distributed evenly across variations—control, then treatment A, then B, then back to control, and so on—ensuring balanced traffic. Around it goes until enough visitors have been tested to achieve statistical significance.

It is important that the number of visitors seeing each version is about the same size. The software tries to enforce this.

Measure Results

The A/B testing software tracks results by monitoring goals. Goals can be any of a number of measurable things:

Products bought by each visitor and the amount paid
Subscriptions and signups completed by visitors
Forms completed by visitors
Documents downloaded by visitors

While nearly anything can be measured, it’s the business-building metrics—purchases, leads, signups—that matter most.

The software remembers which test page was seen. It calculates the amount of revenue generated by those who saw the control, by those who saw treatment one, and so on.

At the end of the test, we can answer one very important question: which page generated the most revenue, subscriptions or leads? If one of the treatments wins, it becomes the new control.

And the process starts over.

Do Statistical Analysis

The tools are always calculating the confidence that a result will predict the future. We don’t trust any test that doesn’t have at least a 95% confidence level. This means that we are 95% confident that a new change will generate more revenue, subscriptions or leads.

Sometimes it’s hard to wait for statistical significance, but it’s important lest we make the wrong decision and start reducing the website’s conversion rate.

Report Results

Finally, the software communicates results to us. These come as graphs and statistics that not only show results, they help you decide what to implement—and what to test next.

AB Testing Tools deliver data in the form of graphs and statistics.

It’s easy to see that the treatment won this test, giving us an estimated 90.9% lift in revenue per visitor with a 98% confidence.

This is a large win for this client.

Curious about the wins you’d see with a Conversion Scientist™ managing your CRO program? Book a free strategy call today.

Selecting The Right Tools

Of course, there are a lot of A/B testing tools out there, with new versions hitting the market every year. While there are certainly some industry favorites, the tools you select should come down to what your specific businesses requires.

In order to help make the selection process easier, we reached out to our network of CRO specialists and put together a list of the top-rated tools in the industry. We rely on these tools to perform for multi-million dollar clients and campaigns, and we are confident they will perform for you as well.

Check out the full list of tools here: The 20 Most Recommended A/B Testing Tools By Leading CRO Experts

April 15, 2025/by Brian Massey

AB Testing Results are Half-Filled with Losers, and That’s a Good Thing

CRO Tests | Multivariate | AB Testing

The law of unintended consequences states that every human endeavor will generate some result that was not, nor could have been foreseen. The law applies to hypothesis testing as well.

In fact, Brian Cugelman introduced me to an entire spectrum of outcomes that is helpful when evaluating AB testing results. Brian was talking about unleashing chemicals in the brain, and I’m applying his model to AB testing results. See my complete notes on his Conversion XL Live presentation below.

Understanding the AB Testing Results Map

In any test we conduct, we are trying everything we can to drive to a desired outcome. Unfortunately, we don’t always achieve the outcomes we want or intend. For any test, our results lie on one of two spectrums defining four general quadrants.

Map of possible outcomes from hypotheses.

On one axis we ask, “Was the outcome as we intended, or was there unintended result?” On the other axis we ask, “Was it a negative or positive outcome?”

While most of our testing seeks to achieve the quadrant defined by positive, intended outcomes, each of these quadrants gives us an opportunity to move our conversion optimization program a step forward.

I. Pop the Champaign, We’ve Got a New Control

With every test, we seek to “beat” the existing control, the page or experience that is currently performing the best of all treatments we’ve tried. When our intended outcome is a positive outcome, everyone is all smiles and dancing. It’s the most fun we have in this job.

In general, we want our test outcomes to fall into this quadrant (quadrant I), but not exclusively. There is much to be learned from the other three quadrants.

II. Testing to Lose

Under what circumstances would we actually run an AB test intending to see a negative outcome? That is the question of Quadrant II. A great example of this is adding “Captcha” to a form.

CAPTCHA is an acronym for “Completely Automated Public Turing test to tell Computers and Humans Apart”. We believe it should be called, “Get Our Prospects to Do Our Spam Management For Us”, or “GOPDOSMFU”. Businesses don’t like to get spam. It clogs their prospect inboxes, wastes the time of sales people and clouds their analytics.

However, we don’t believe that the answer is to make our potential customers take an IQ test before submitting a form.

These tools inevitably reduce form completion rates, and not just for spam bots.

CAPTCHAs reduce spam, but at what cost?

So, if a business wants to add Captcha to a form, we recommend understanding the hidden costs of doing so. We’ll design a test with and without the Captcha, fully expecting a negative outcome. The goal is to understand how big the negative impact is. Usually, it’s too big.

In other situations, a design feature that is brand oriented may be proposed. Often a design decision that enhances the company brand will have a negative impact on conversion and revenue. Nonetheless, we will test to see how big the negative impact is. If it’s reasonable, then the loss of revenue is seen as a marketing expense. In other words, we expect the loss of short-term revenue to offset long term revenue from a stronger brand message.

These tests are like insurance policies. We do them to understand the cost of decisions that fall outside of our narrow focus on website results. The question is not, “Is the outcome negative?” The question is, “How negative is the outcome?”

III. Losers Rule Statistically

Linus Pauling once said, “You can’t have good ideas without having lots of ideas.” What is implied in this statement is that most ideas are crap. Just because we call them test hypotheses doesn’t mean that they are any more valuable than rolls of the dice.

When we start a conversion optimization process, we generate a lot of ideas. Since we’re brilliant, experienced, and wear lab coats, we brag that only half of these ideas will be losers for any client. Fully half won’t increase the performance of the site, and many will make things worse.

Most of these fall into the quadrant of unintended negative outcomes. The control has won. Record what we learned and move on.

There is a lot to be learned from failed tests. Note that we call them “inconclusive” tests as this sounds better than “failed”.

If the losing treatment reduced conversion and revenue, then you learn something about what your visitors don’t like.

Just like a successful test, you must ask the question, “Why?”.

Why didn’t they like our new background video? Was it offensive? Did it load too slowly? Did it distract them from our message?

Take a moment and ask, “Why,” even when your control wins.

IV. That Wasn’t Expected, But We’ll Take the Credit

Automatic.com was seeking a very specific outcome when we designed a new home page for them: more sales of the adapter and app that connects a smartphone to the car’s electronic brain. The redesign did achieve that goal. However, there was another unintended result. There was an increase in the number of people buying multiple adapters rather than just one.

We weren’t testing to increase average order value in this case. It happened nonetheless. We might have missed it if we didn’t instinctively calculate average order value when looking at the data. Other unintended consequences may be harder to find.

This outcome usually spawns new hypotheses. What was it about our new home page design that made more buyers decide to get an adapter for all of their cars? Did we discover a new segment, the segment of visitors that have more than one car?

These questions all beg for more research and quite possibly more testing.

When Outcomes are Mixed

There is rarely one answer for any test we perform. Because we have to create statistically valid sample sizes, we throw together some very different groups of visitors. For example, we regularly see a difference in conversion rates between visitors using the Safari browser and those using Firefox. On mobile, we see different results when we look only at visitors coming on an Android than when we look at those using Apple’s iOS.

3.9% more Android users converted with this design, while 21% fewer iPhone users converted.

Android users liked this test but iPhone users really did not.

In short, you need to spend some time looking at your test results to ensure that you don’t have offsetting outcomes.

The Motivational Chemistry and the Science of Persuasion

Here are my notes from Brian Cugelman’s presentation that inspired this approach to AB testing results. He deals a lot with the science of persuasion.

My favorite conclusions are:

“You will get more mileage from ANTICIPATION than from actual rewards.”

“Flattery will get you everywhere.”

I hope this infographic generates some dopamine for you, and your new found intelligence will produce seratonin during your next social engagement.

Motivational Chemistry infodoodle by Brian Cugelman at ConversionXL Live

April 21, 2016/by Brian Massey

Hypotheses: Deciding what to Test

CRO Tests | Multivariate | AB Testing

Today’s question is at the heart of AB testing. “How do you decide what elements of a site to test?” We call the test “hypotheses.”
But, a better question is, “How do you determine what NOT to test.”

It’s relatively easy to come up with ideas that might increase your conversion rate. We typically come up with fifty, seventy-five, one-hundred or more ideas for each of our client sites. Filtering through this list is the hard part.

The Five Steps

In this week’s podcast, I take you through the five steps we use to determine what to test on a website.

Step One: Look for Evidence
Step Two: Rate the Traffic
Step Three: How Hard is it to Test?
Step Four: What does experience tell you?
Step Five: Bucket the Winners

We’re pretty good at picking low-hanging fruit. Last year 97% of our clients continued working with us after our initial six-month Conversion Catalyst program that uses this approach.

Each of our hypotheses gets an ROI score using the following formula:

ROI = Evidence + Traffic Value + History – Level of Effort

Once we’ve ranked all of our hypotheses, we classify them into buckets.

The top ten hypotheses reveal an interesting pattern when you bucket them.

Bucketing Your Hypotheses

I also talk about how we classify hypotheses into buckets.

User Experience: For hypotheses that would alter the layout, design, or other user interface and user experience issues.
Credibility and Authority: For hypotheses that address trust and credibility issues of the business and the site.
Social Proof: For hypotheses that build trust by showing others’ experiences.
Value Proposition: For hypotheses that address the overall messaging and value proposition. Quality, availability, pricing, shipping, business experience, etc.
Risk Reversal: For hypotheses that involving warranties, guarantees and other assurances of safety.

This helps us understand what the primary areas of concern are for visitors to a site. Are there a lot of high-ranked hypotheses for Credibility and Authority? We need to focus on building trust with visitors.

There’s much more detail in the podcast and my Marketing Land column 5 Steps to Finding the Hidden Optimization Gems.

August 21, 2015/by Brian Massey