The Ultimate A/B Testing Guide: Everything You Need, All In One Place
Welcome to the ultimate A/B testing guide from the original and most experienced Conversion Optimization Agency on the planet!
In this post, we’re going to cover everything you need to know about A/B testing (also referred to as “split” testing), from start to finish. Here’s what we’ll cover:
By the end of this guide, you’ll have a thorough understanding of the entire AB testing process and a framework for diving deeper into any topic you wish to further explore.
In addition to this guide, we’ve put together an intuitive 9-part course taking you through the fundamentals of conversion rate optimization. Complete the course, and we’ll review your website for free!
No time to learn it all on your own? Check out our turn-key Conversion Rate Optimization Services and book a consultation to see how we can help you.
1. The Basic Components Of A/B Testing
AB testing, also referred to as “split” or “A/B/n” testing, is the process of testing multiple variations of a web page in order to identifying higher performing variations and improve the page’s conversion rate.
Over the last few years, AB testing has become “kind of a big deal”.
Online marketing tools have become more sophisticated and less expensive, making split testing a more accessible pursuit for small and mid-sized businesses. And with traffic becoming more expensive, the rate at which online businesses are able to convert incoming visitors is becoming more and more important.
The basic A/B testing process looks like this:
- Make a hypothesis about one or two changes you think will improve the page’s conversion rate.
- Create a variation or variations of that page with one change per variation.
- Divide incoming traffic equally between each variation and the original page.
- Run the test as long as it takes to acquire statistically significant findings.
- If a page variation produces a statistically significant increase in page conversions, use it it replace the original page.
- Repeat
Have you ever heard the story of someone changing their button color from red to green and received a $5 million increase in sales that year?
As cool as that sounds, let’s be honest: it is not likely that either you or I will see this kind of a win anytime soon. That said, one button tweak did result in $300 million in new revenue for one business, so it is possible.
AB testing is a scientific way of finding out if your tweak that leads to a boost in conversions is actually significant, or just a random flux.
AB testing (AKA “split testing”) is the process of directing your traffic to two or more variations of a web page.
AB testing is pretty simple to understand:
Our testing software is the “Moses” that splits our traffic for us. Additionally, you can choose to experiment with more variations than an AB test. These tests are called A/B/n tests, where “n” represents any number of new variations.
So that could be an “A/B/C” test, an “A/B/C/D” test, and so on.
Here’s what an A/B/C test would look like:
Even though the same traffic is sent to the Control and each Variation, a different number of visitors will typically complete their task — buy, signup, subscribe, etc. This is because a many leave your site first.
The primary point of an AB test is to discover what issues cause visitors to leave. The issues above are common to ecommerce websites. In this case we might create additional variations:
- One that adds a return policy to the page.
- One that removes the registration requirement.
- One that adds trust symbols to the site.
By split testing these changes, we see if we can get more of these visitors to finish their purchase, to convert.
How do we know which issues might be causing visitors leave? This is done by researching your visitors, looking at analytics data, and making educated guesses, which we at Conversion Sciences call “hypotheses”.
In the image above, the number of visitors that complete a transaction is shown. Based on this data, we would learn that adding a return policy and trust symbols would increase success over the Control or removing registration.
The page that added the return policy is our new Control. Our next test would very likely be to see what happens when we add trust symbols to this new Control. It is not unlikely that combining the two could actually reduce the conversion rate. So we test it.
Likewise, it is possible that removing the registration requirement would work well on the page with the return policy, our new Control. However, we may not test this combination.
With an AB test, we try each change on it’s own variation to isolate the specific issues and decide which combinations to test based on what we learn.
The goal of AB testing is to identify and verify changes that will increase a page’s overall conversion rate, whether those changes are minor or more involved.
I’m fond of saying that AB testing, or split testing, is the “Supreme Court” of data collection. An AB test gives us the most reliable information about a change to our site. It controls for a number of variables that can taint our data.
2. The Proven AB Testing Framework
Now that we have a feel for the tests themselves, we need to understand how these tests fit into the grand scheme of things.
There’s a reason we are able to get consistent results for our clients here at Conversion Sciences. It’s because we have a proven framework in place: a system that allows us to approach any website and methodically derive revenue-boosting insights.
Different businesses and agencies will have their own unique processes within this system, but any CRO agency worth it’s name will follow some variation of the following framework when conducting A/B testing.
For a closer look at each of these nine steps, check out our in-depth breakdown here: The Proven AB Testing Framework Used By CRO Professionals
3. The Critical Statistics Behind Split Testing
You don’t need to be a mathematician to run effective AB tests, but you do need a solid understanding of the statistics behind split testing.
An AB test is an example of statistical hypothesis testing, a process whereby a hypothesis is made about the relationship between two data sets and those data sets are then compared against each other to determine if there is a statistically significant relationship or not.
To put this in more practical terms, a prediction is made that Page Variation #B will perform better than Page Variation #A, and then data sets from both pages are observed and compared to determine if Page Variation #B is a statistically significant improvement over Page Variation #A.
That seems fairly straightforward, so where does it get complicated?
The complexities arrive in all the ways a given “sample” can inaccurately represent the overall “population”, and all the things we have to do to ensure that our sample can accurately represent the population.
Let’s define some terminology real quick.
The “population” is the group we want information about. It’s the next 100,000 visitors in my previous example. When we’re testing a webpage, the true population is every future individual who will visit that page.
The “sample” is a small portion of the larger population. It’s the first 1,000 visitors we observe in my previous example.
In a perfect world, the sample would be 100% representative of the overall population.
For example:
Let’s say 10,000 out of those 100,000 visitors are going to ultimately convert into sales. Our true conversion rate would then be 10%.
In a tester’s perfect world, the mean (average) conversion rate of any sample(s) we select from the population would always be identical to the population’s true conversion rate. In other words, if you selected a sample of 10 visitors, 1 of them (10%) would buy, and if you selected a sample of 100 visitors, then 10 would be buy.
But that’s not how things work in real life.
In real life, you might have only 2 out of the first 100 buy or you might have 20… or even zero. You could have a single purchase from Monday through Friday and then 30 on Saturday.
This variability across samples is expressed as a unit called the “variance”, which measures how far a random sample can differ from the true mean (average).
This variance across samples can derail our findings, which is why we have to employ statistically sound hypothesis testing in order get accurate results.
For example:
How AB Testing Eliminates Timing Issues
One alternative to AB testing is “serial” testing, or change-something-and-see-what-happens testing. I am a fan of serial testing, and you should make it a point to go and see how changes are affecting your revenue, subscriptions and lead.
There is a problem, however. If you make your change at the same time that a competitor starts an awesome promotion, you may see a drop in your conversion rates. You might blame your change when, in fact, the change in performance was an external market force.
AB testing controls for this.
In an AB test, the first visitor sees the original page, which we call the Control. This is the “A” in the term “AB test”.The next visitor sees a version of the page with the change that’s being tested. We call this a Treatment, or Variation. This is the “B” in the term AB test. We can also have a “C” and a “D” if we have enough traffic.
The next visitor sees the control and the next the treatment. This goes on until we enough people have seen each version to tell us which they like best. We call this statistical significance. Our software tracks these visitors across multiple visits and tells us which version of the page generated the most revenue or leads.
Since visitors come over the same time period, changes in the marketplace — like our competitor’s promotion — won’t affect our results. Both pages are served during the promotion, so there is no before-and-after error in the data.
Another way variance can express itself is in the way different types of traffic behave differently. Fortunately, you can eliminate this type of variance simply by segmenting traffic.
How Visitor Segmentation Controls For Variability
An AB test gathers data from real visitors and customers who are “voting” on our changes using their dollars, their contact information and their commitment to our offerings. If done correctly, the makeup of visitors should be the same for the control and each treatment.
This is important. Visitors that come to the site from an email may be more likely to convert to a customer. Visitors coming from organic search, however, may be early in their research, with not as many ready to buy.
If you sent email traffic to your control and search traffic to the treatment, it may appear that the control is a better implementation. In truth, it was the kind of traffic or traffic segment that resulted in the different performance.
By segmenting types of traffic and testing them separately, you can easily control for this variation and get a much better understanding of visitor behavior.
Why Statistical Significance Is Important
One of the most important concepts to understand when discussing AB testing is statistical significance, which is ultimately all about using large enough sample sizes when testing. There are many places where you can acquire a more technical understanding of this concept, so I’m going to attempt to illustrate it instead in layman’s terms.
Imagine flipping a coin 50 times. While from a probability perspective, we know there is a 50% chance of any given flip landing on heads, that doesn’t mean we will get 25 heads and 25 tails after 50 flips. In reality, we will probably see something like 23 heads and 27 tails or 28 heads and 22 tails.
Our results won’t match the probability because there is an element of chance to any test – an element of randomness that must be accounted for. As we flip more times, we decrease the effect this chance will have on our end results. The point at which we have decreased this element of chance to a satisfactory level is our point of statistical significance.
In the same way, when running an AB tests on a web page, there is an element of chance involved. One variation might happen to receive more primed buyers than the other or perhaps an isolated group of visitors happen to have a negative association with an image used on one page. These chance factors will skew your results if your sample size isn’t large enough.
It’s important not to conclude an AB test until you have reach statistically significant results. Here’s a handy tool to check if your sample sizes are large enough.
For a closer look at the statistics behind A/B testing, check out this in-depth post: AB Testing Statistics: An Intuitive Guide For Non-Mathematicians
4. How To Conduct Pre-Test Research
The definition of optimization boils down to understanding your visitors.
In order to succeed at A/B testing, we need to be creating variations that perform better for our visitors. In order to create those types of variations, we need to understand what visitors aren’t liking about our existing site and what they want instead.
Aka we need research.
5. How To Create An A/B Testing Strategy
Once we’ve done our homework and identified both problem areas and opportunities for improvement on our site, it’s time to develop a core testing strategy.
An A/B testing strategy is essentially a lens through which we will approach test creation. It helps us prioritize and focus our efforts in the most productive direction possible.
There are 7 primary testing strategies that we use here at Conversion Sciences.
- Gum Trampoline
- Completion Optimization
- Flow Optimization
- Minesweeper
- Big Rocks
- Big Swings
- Nuclear Option
Since there is little point in summarizing these, click here to read our breakdown of each strategy: The 7 Core Testing Strategies Essential To Optimization
6. “AB” & “Split” Testing Versus “Multivariate” Testing
While most marketers tend to use these terms interchangeably, there are a few differences to be aware of. While AB testing and split testing are the exact same thing, multivariate testing is slightly different.
AB and Split tests refer to tests that measure larger changes on a given page. For example, a company with a long-form landing page might AB test the page against a new short version to see how visitors respond. In another example, a business seeking to find the optimal squeeze page might design two pages around different lead magnets and compare them to see which converts best.
Multivariate testing, on the other hand, focuses on optimizing small, important elements of a webpage, like CTA copy, image placement, or button colors. Often, a multivariate test will test more than two options at a time to quickly identify outlying winners. For example, a company might run a multivariate test cycling 6 different button colors on its most important sales page. With high enough traffic, even a 0.5% increase in conversions can result in a significant revenue boost.
While most websites can run meaningful split tests, multivariate tests are typically reserved for bigger sites, as they require a large amount traffic to produce statistically significant results.
For a more in-depth look at multivariate testing, click here: Multivariate Testing: Promises and Pitfalls for High-Traffic Websites
7. How To Analyze Testing Results
After we’ve run our tests, it’s time to collect and analyze the results. My co-founder Joel Harvey explains how Conversion Sciences approaches post-test analysis below:
When you look at the results of an AB testing round, the first thing you need to look at is whether the test was a loser, a winner, or inconclusive.
Verify that the winners were indeed winners. Look at all the core criteria: statistical significance, p-value, test length, delta size, etc. If it checks out, then the next step is to show it to 100% of traffic and look for that real-world conversion lift.
In a perfect world you could just roll it out for 2 weeks and wait, but usually, you are jumping right into creating new hypotheses and running new tests, so you have to find a balance.
Once we’ve identified the winners, it’s important to dive into segments.
- Mobile versus non-mobile
- Paid versus unpaid
- Different browsers and devices
- Different traffic channels
- New versus returning visitors (important to setup and integrate this beforehand)
This is fairly easy to do with enterprise tools, but might require some more effort with less robust testing tools. It’s important to have a deep understanding of how tested pages performed with each segment. What’s the bounce rate? What’s the exit rate? Did we fundamentally change the way this segment is flowing through the funnel?
We want to look at this data in full, but it’s also good to remove outliers falling outside two standard deviations of the mean and re-evaluate the data.
It’s also important to pay attention to lead quality. The longer the lead cycle, the more difficult this is. In a perfect world, you can integrate the CRM, but in reality, this often doesn’t work very seamlessly.
For a more in-depth look at post test analysis, including insights from the CRO industry’s foremost experts, click here: 10 CRO Experts Explain How To Profitably Analyze AB Test Results
8. How AB Testing Tools Work
The tools that make AB testing possible provide an incredible amount of power. If we wanted, we could use these tools to make your website different for every visitor to your website. The reason we can do this is that these tools change your site in the visitors’ browsers.
When these tools are installed on your website, they send some code, called JavaScript along with the HTML that defines a page. As the page is rendered, this JavaScript changes it. It can do almost anything:
- Change the headlines and text on the page.
- Hide images or copy.
- Move elements above the fold.
- Change the site navigation.
Primary Functions of AB Testing Tools
AB testing software has the following primary functions.
Serve Different Webpages to Visitors
The first job of AB testing tools is to show different webpages to certain visitors. The person that designed your test will determine what gets showed.
An AB test will have a “control”, or the current page, and at least one “treatment”, or the page with some change. The design and development team will work together to create a different treatment. The JavaScript must be written to transform the control into the treatment.
It is important that the JavaScript work in on all devices and in all browsers used by the visitors to a site. This requires a committed QA effort.
Conversion Sciences maintains a library of devices of varying ages that allows us to test our JavaScript for all visitors.
Split Traffic Evenly
Once we have JavaScript to display one or more treatements, our AB testing software must determine which visitors see the control and which see the treatments.
Typically, every other user will get a different page. The first will see the control, the next will see the first treatment, the next will see the second treatment and the fourth will see the control. Around it goes until enough visitors have been tested to achieve statistical significance.
It is important that the number of visitors seeing each version is about the same size. The software tries to enforce this.
Measure Results
The AB testing software tracks results by monitoring goals. Goals can be any of a number of measurable things:
- Products bought by each visitor and the amount paid
- Subscriptions and signups completed by visitors
- Forms completed by visitors
- Documents downloaded by visitors
Almost anything can be measured, but the most important are business-building metrics such as purchases, subscriptions and leads generated.
The software remembers which test page was seen. It calculates the amount of revenue generated by those who saw the control, by those who saw treatment one, and so on.
At the end of the test, we can answer one very important question: which page generated the most revenue, subscriptions or leads? If one of the treatments wins, it becomes the new control.
And the process starts over.
Do Statistical Analysis
The tools are always calculating the confidence that a result will predict the future. We don’t trust any test that doesn’t have at least a 95% confidence level. This means that we are 95% confident that a new change will generate more revenue, subscriptions or leads.
Sometimes it’s hard to wait for statistical significance, but it’s important lest we make the wrong decision and start reducing the website’s conversion rate.
Report Results
Finally, the software communicates results to us. These come as graphs and statistics.
It’s easy to see that the treatment won this test, giving us an estimated 90.9% lift in revenue per visitor with a 98% confidence.
This is a rather large win for this client.
Selecting The Right Tools
Of course, there are a lot of A/B testing tools out there, with new versions hitting the market every year. While there are certainly some industry favorites, the tools you select should come down to what your specific businesses requires.
In order to help make the selection process easier, we reached out to our network of CRO specialists and put together a list of the top-rated tools in the industry. We rely on these tools to perform for multi-million dollar clients and campaigns, and we are confident they will perform fo you as well.
Check out the full list of tools here: The 20 Most Recommended AB Testing Tools By Leading CRO Experts
9. How To Build An A/B Testing Team
Conversion Sciences offers a complete turnkey team for testing. Every team that will use these tools must have competent people in the following roles, and we recommend you follow suit in building your own teams.
Data Analyst
The data analyst looks at the data being collected by analytics tools, user experience tools, and information collected by the website owners. From this she begins developing ideas, or hypotheses, for why a site doesn’t have a higher conversion rate.
The data analyst is responsible for designing tests that prove or disprove a hypothesis. Once the test is designed, she hands it off to the designer and developer for implementation.
Designer
The designer is responsible for designing new components for the site. These may be as simple as creating a button with a different call to action, to completely redesigning a landing page for conversion.
The designer must be experienced enough to carefully design the changes we are testing. We want to change the element we are testing and nothing else.
Developer
Our developers are very good at creating JavaScript that manipulates a page without breaking anything. They are experienced enough to write JavaScript that will run successfully on a variety of devices, operating systems and browsers.
QA Tech
The last thing we want to do is break a commercial website. This can result in lost revenue and invalidate our tests. A good quality assurance person checks the JavaScript and design work to ensure it works on all relevant devices, operating systems and browsers.
Getting Started on AB Testing
Conversion Sciences invites all businesses to work AB testing into their marketing mix. You can start by working with us and then move the effort in-house.
Get started with our 180-day Conversion Catalyst program, a process designed to get you started AND pay for itself with newly discovered revenue.
- Confirmation Bias: What It Is and How It’s Hurting Your Website Conversions - August 20, 2024
- The Conversion Optimization Process for High Converting Websites - August 20, 2024
- Two Guys on Your Website: The Surprising Link Between CRO and SEO - June 27, 2024
I landed on your blog through Growthhackers.com and found out that you guys are doing great. I really loved this A/B test guide. Keep on generating such great content.
Thanks for the kind words!
The portion regarding calculating sample size is incomplete.
You suggest to navigate to this link to calculate if results are significant: https://vwo.com/tools/ab-test-siginficance-calculator/
This calculator is fine to test if the desired confidence interval is met; however, it doesn’t consider whether the correct sample size was evaluated to determine if the results are statistically significant.
For example, let’s use the following inputs:
Number of visitors – Control = 1000, Variant = 1000
Number of conversions – Control = 10, Variant = 25
If plug these numbers into the calculator, we’ll that it met the confidence interval of 95%. The issue is that the sample size needed to detect a lift 1.5% with base conversions of 1% (10 conversions / 1000 visitors) would be nearly 70,000.
Just because a confidence interval was met does NOT mean that sample size is large enough to be statistically significant. This is a huge misunderstanding in the A/B community and needs to be called out.
Colton, you are CORRECT. We would never rely on an AB test with only 35 conversions.
In this case, we can look at the POWER of the test. Here’s another good tool that calculates the power for you.
The example you present shows a 150% increase, a change so significant that it has a power of 97%.
So, as a business owner, can I accept the risk that this is a false positive (<3%) or will I let the test run 700% longer to hit pure statistical significance?