CRO Course Level 3

We all make decisions every day based on what other people are doing. You are wired to navigate the world using behavioral data.

When you check Facebook to see how many people like and comment on your most recent post, you’re using your built-in behavioral know-how. When you select a movie based on the Rotten Tomatoes Freshness Score, you are getting your behavioral science on. The New York Times Best Seller list, the Billboard Charts, and the laugh track on The Big Bang Theory are all sources of behavioral data that we use to make decisions every day.

If you don’t believe me, let’s use an example. When my son was 14, he built his own gaming computer. He had meticulously researched every component, from the high-frequency monitor to the mouse pad. His last decision was the motherboard, the foundation of the computer that every element plugs into.

He had narrowed it down to two alternatives. They had the same features and were priced within pennies of each other. Reviewers of the motherboards had given one a four-star rating, and the other a five-star rating.

If we didn’t understand the first rule of behavioral data, we would have simply chosen the five-star motherboard. Five stars is better than four, right? But even at the tender age of 14, Sean was smart enough to see how many reviews had fed those ratings.

Two products with similar features and price. You know which rating to believe.

Two products with similar features and price. You know which rating to believe.

The five-star motherboard had five reviews, while the lower-rated four-star motherboard had 250 reviews. You have no doubt about which rating is most reliable. Your little brain, like my son’s, is doing the math. We know that the five-star rating is just as likely to be a three- or two-star rating.

The data isn’t in.

You are intuitively calculating what statisticians call n: The Sample Size of the data collected (the reviews). And you know the first rule of behavioral data.

1. Larger Sample Sizes Are Better Than Smaller Sample Sizes

It is rarely feasible to ask every single person who would buy from us what it would take to get them to give us money. Instead, we ask a sample of our audience what they think. If our sample is big enough, we can assume that our entire audience will feel the same way.

The larger the sample size we can generate–little n–the more accurately we can predict how our campaigns and websites will perform.

This is why a “launch and see” approach is so appealing. We feel that we need to launch something to reach a large sample size of potential buyers. Or, we decide to rely on experts to make good decisions about what we should create to sell our businesses.

Campaign development often starts with a creative director. In this case, n=1. If this person gets input from their team, we are getting insights from a handful of people. Our sample size might be five or ten. If we run a focus group or survey, we can get the input from a dozen people or more. Little n might be as high as 20.

These sample sizes are not enough to predict the future statistically. You can see why advertising and web sites designed by small teams can fail. The number of people involved in the research is small.

When we collect online analytics, we are involving hundreds or thousands of people in our development process. Little n is much, much larger. As we all know intuitively, this means the data is more reliable, like the number of reviews of our motherboards.

We are also calculating another statistical value when we look at these ratings and reviews: the total population, or N.

2. Data Over Time Is Better Than Data At One Point In Time

When we consider our two motherboards, and look at the sample sizes, we will naturally infer how many of each had been sold. The 5-star product only has five reviews. Either it had not been on the market long, or it just wasn’t selling well. As a result, we will assume that the population of buyers–N–is small, and that the time over which these reviews were collected was small.

We know intuitively that data collected over a long time is better than data collected over a short time. Things change over time. Even within a week, people buy more or less on weekdays than an weekends.

When we run a focus group, launch a survey, do a marketing study, we’re measuring our audience at one point in time. If you survey swimmers about their preferences for beachwear in January, you might get very different results than if you asked them in July.

When data is cheap, we can measure it year round, all the time. Behavioral data can be collected constantly on our digital properties. With just a few lines of code on your website, your analytics software builds a very helpful behavioral database day and night. Once you have this database, you can decide what part of the year you want to examine. Or use the entire year.

Our friends at Decorview sell high-end window treatments from companies like Hunter Douglas. One might assume that the people that buy luxury household items like this would not be too price sensitive. We found some data that told us the opposite.

When we examined the search ads that  Decorview had run, we found that ads featuring discounts were far and away the most clicked. These ad campaigns had been running for months and years, so we tended to trust the data.

Ads featuring discounts generated more clicks for high-end window treatments.

Ads featuring discounts generated more clicks for high-end window treatments.

We changed the landing page to feature discounts and saw a 40% increase in leads from an AB test.

Advertising data collected over time helped us create a high-converting landing page.

Advertising data collected over time helped us create a high-converting landing page.

3. More-Recent Data Is Better Than Less-Recent Data

If, in fact, things and people change over time, then we would tend to trust more recent data vs. old data. This is also way we may retest something we already collected data on last year.

Traditionally, market research has takes time and effort. Most marketing studies were months old before they are applied to a campaign. As the time and cost of research has dropped, we perform studies more often and with more precision.

There is little good reason to use stale data when it is so easy to collect it fresh from the farm.

The personalized childrens book The Little Boy Who Lost His Name had sold a half-million copies, and the publisher had high hopes for the next installment, The Incredible Galactic Journey Home.

The Little Boy Who Lost His Name had sold over 500,000 copies.

The Little Boy Who Lost His Name had sold over 500,000 copies. Source: UsabilityHub.com

Unfortunately, the Incredible Intergalactic Journey didn’t sell nearly as well. The past had not repeated.

Some alternative covers were developed and data collected through UsabilityHub.

Courtesy: UsabilityHub.com

Source: UsabilityHub.com

The newly designed cover immediately improved sales. Things had changed since the first book came out, and fresh data was needed.

4. Observational Data Is Better Than Self-Reported Data

When we ask a survey panel or focus group what they think of our creative, they will lie. Humans are very good at rationalizing their decisions, but few know the real psychological reasons why they act the way they do.

Behavioral input, on the other hand, is an observation of people as they act. We don’t necessarily have to ask them why they do something. We can watch.

The classic manifestation of this is the derided popup window. Universally despised by everyone you ask, these little windows will reliably increase leads and subscribers in almost every situation, especially exit-intent popovers. The self-reported data clearly doesn’t support the observational data.

5. Customers & Prospects Are More Believable Than Pretenders

Yelp has come under fire in recent years because of fake reviews. Businesses were hiring people to write glowing reviews about them. It turns out that advertisers want to take advantage of your natural behavioral science abilities to trick you with bad data.

When we see a brand with tens of thousands of likes on Facebook, we take it with a grain of salt. It’s easy to like something. That doesn’t mean these likes came from customers.

When launching surveys, taste tests and focus groups, we want to get subjects that are as much like our customers as possible, but ultimately, it is unlikely they are searching for our product at the time we are asking them their opinion.

This is why keyword advertising is better than display ads. We’re speaking to people who are more likely looking for what we offer.

Behavioral data, by definition, is gathered from the activities of prospects and customers as they interact with our digital properties, our products and our services. They wouldn’t be there otherwise. We can trust that the larger population of prospects will behave similarly.

6. Quantitative Data Is More Reliable Than Qualitative Data

When you look at the star rating for a product and the number of reviews, you are using quantitative data to guide your decision.

When you read the reviews, you are using qualitative data.

Both can be helpful, but only one will predict the future reliably.

Qualitative collection methods allow us to drill down with a few subjects to understand more about their emotions and motivations. We don’t know if these emotions and motivations are representative of the broader market.

Quantitative data gives us more statistical confidence that what we are seeing represents the larger market. However, this data isn’t seasoned with human input. Both are important.

The two can be used hand-in-hand. It’s the quantitative data that ultimately wins the day.

The company Automatic sells a device that plug into most modern cars and connects a car’s computer to your phone. This connection gives drivers the data they need to maintain their automobiles and become better drivers. Automatic launched a “Pro” version of their product that didn’t require your phone to connect to the internet. It had its own 3G connection.

The feature comparison chart presented by Automatic.

The feature comparison chart that caused more buyers to choose Automatic Lite.

Yet, most people were buying the Lite version. We wanted to find out why.

We asked buyers why they chose the Lite version in a popup survey on the receipt page. We got a lot of feedback, but this comment summed it up best:

A thank-you page survey asked, "What made you choose Automatic Lite over Pro?"

A thank-you page survey asked, “What made you choose Automatic Lite over Pro?”

We didn’t stop with this input. We removed the confusing features from the feature list, and designed an AB test to collect some observational data.

Our AB test tested a shorter feature list, eliminating confusing features.

Our AB test tested a shorter feature list, eliminating confusing features.

In an AB test, the visitors don’t even realize that they are being tested. We are simply observing the results of their interactions. In this case, our changes increased conversion rates, and increased sales of the Pro unit as a percentage of overall sales.

Final Thoughts

There has been a lot of focus on AB testing as a marketing tool in recent years. This is because AB tests are designed to follow all of the rules of behavioral data. They are designed to deliver observational, recent data, taken over time from a statistically significant sample of prospects that can be quantitatively analyzed.

As a marketer, you can tap into this innate scientific know-how, using it to predict the performance of your campaigns and make them better.


21 Quick and Easy CRO Copywriting Hacks to Skyrocket Conversions

FREE: Click to Download

21 Quick and Easy CRO Copywriting Hacks

Keep these proven copywriting hacks in mind to make your copy convert.

  • 43 Pages with Examples
  • Assumptive Phrasing
  • "We" vs. "You"
  • Pattern Interrupts
  • The power of Three

The AB test results had come in, and the result was inconclusive. The Conversion Sciences team was disappointed. They thought the change would increase revenue. What they didn’t know what that the top-level results were lying.

While we can learn something from inconclusive tests, it’s the winners that we love. Winners increase revenue, and that feels good.

The team looked closer at our results. When a test concludes, we analyze the results in analytics to see if there is any more we can learn. We call this post-test analysis.

Isolating the segment of traffic that saw test variation A, it was clear that one browser had under-performed the others: Internet Explorer.

Performance of Variation A. Internet Explorer visitors significantly underperformed the other three popular browsers.

Performance of Variation A. Internet Explorer visitors significantly under-performed the other three popular browsers.

The visitors coming on Internet Explorer were converting at less than half the average of the other browsers and generating one-third the revenue per session. This was not true of the Control. Something was wrong with this test variation. Despite a vigorous QA effort that included all popular browsers, an error had been introduced into the test code.

Analysis showed that correcting this would deliver a 13% increase in conversion rate and 19% increase in per session value. And we would have a winning test after all.

Conversion Sciences has a rigorous QA process to ensure that errors like this are very rare, but they happen. And they may be happening to you.

Post-test analysis keeps us from making bad decisions when the unexpected rears its ugly head. Here’s a primer on how conversion experts ensure they are making the right decisions by doing post-test analysis.

Did Any Of Our Test Variations Win?

The first question that will be on our lips is, “Did any of our variations win?”

There are two possible outcomes when we examine the results of an AB test.

  1. The test was inconclusive. None of the alternatives beat the control. The null hypotheses was not disproven.
  2. One or more of the treatments beat the control in a statistically significant way.

Joel Harvey of Conversion Sciences describes his process below:

[su_note note_color=”#dcf0df” text_color=”#000000″ radius=”10″]

Joel Harvey, Conversion ScientistJoel Harvey, Conversion Sciences

“Post-test analysis” is sort of a misnomer. A lot of analytics happens in the initial setup and throughout full ab testing process. The “post-test” insights derived from one batch of tests is the “pre-test” analytics for the next batch, and the best way to have good goals for that next batch of tests is to set the right goals during your previous split tests.

That said, when you look at the results of an AB testing round, the first thing you need to look at is whether the test was a loser, a winner, or inconclusive.

Verify that the winners were indeed winners. Look at all the core criteria: statistical significance, p-value, test length, delta size, etc. If it checks out, then the next step is to show it to 100% of traffic and look for that real-world conversion lift.

In a perfect world you could just roll it out for 2 weeks and wait, but usually, you are jumping right into creating new hypotheses and running new tests, so you have to find a balance.

Once we’ve identified the winners, it’s important to dive into segments.

  • Mobile versus non-mobile
  • Paid versus unpaid
  • Different browsers and devices
  • Different traffic channels
  • New versus returning visitors (important to setup and integrate this beforehand)

This is fairly easy to do with enterprise tools, but might require some more effort with less robust testing tools. It’s important to have a deep understanding of how tested pages performed with each segment. What’s the bounce rate? What’s the exit rate? Did we fundamentally change the way this segment is flowing through the funnel?

We want to look at this data in full, but it’s also good to remove outliers falling outside two standard deviations of the mean and re-evaluate the data.

It’s also important to pay attention to lead quality. The longer the lead cycle, the more difficult this is. In a perfect world, you can integrate the CRM, but in reality, this often doesn’t work very seamlessly.

[/su_note]

Chris McCormick, Head of Optimisation at PRWD, describes his process:

[su_note note_color=”#dcf0df” text_color=”#000000″ radius=”10″]

chris-mccormickChris McCormick, PRWD

When a test concludes, we always use the testing tool as a guide but we would never hang our hat on that data. We always analyse results further within Google Analytics, as this is the purest form of data.

For any test, we always set out at the start what our ‘primary success metrics’ are. These are what we look to identify first via GA and what we communicate as a priority to the client. Once we have a high level understanding of how the test has performed, we start to dig below the surface to understand if there are any patterns or trends occurring. Examples of this would be: the day of the week, different product sets, new vs returning users, desktop vs mobile etc.

We always look to report on a rough ROI figure for any test we deliver, too. In most cases, I would look to do this based on taking data from the previous 12 months and applying whatever the lift was to that. This is always communicated to the client as a ballpark figure i.e.: circa £50k ROI. The reason for this is that there are so many additional/external influences on a test that we can never be 100% accurate; testing is not an exact science and shouldn’t be treated as such.

[/su_note]

Are We Making Type I or Type II errors?

In our post on AB testing statistics, we discussed type I and type II errors. We work to avoid these errors at all cost.

To avoid errors in judgement, we verify the results of our testing tool against our analytics. It is very important that our testing tool send data to our analytics package telling us which variations are seen by which segments of visitors.

Our testing tools only deliver top-level results, and we’ve seen that technical errors happen. So we can reproduce the results of our AB test using analytics data.

Did each variation get the same number of conversions? Was revenue reported accurately?

Errors are best avoided by ensuring the sample size is large enough and utilizing a proper AB testing framework. Peep Laja describes his process below:

[su_note note_color=”#dcf0df” text_color=”#000000″ radius=”10″]

peep-lajaPeep Laja, ConversionXL

First of all I check whether there’s enough sample size and that we can trust the outcome of the test. I check if the numbers reported by the testing tool line up with the analytics tool, both for CR (conversion rate) and RPV (revenue per visit).

In the analytics tool I try to understand how the variations changed user behavior – by looking at microconversions (cart adds, certain page visits etc) and other stats like cart value, average qty per purchase etc.

If the sample size is large enough, I want to see the results of the test across key segments (provided that the results in the segments are valid, have enough volume etc), and see if the treatments performed better/worse inside the segments. Maybe there’s a case for personalization there. The segments I look at are device split (if the test was ran across multiple device categories), new/returning, traffic source, first time buyer / repeat buyer.

[/su_note]

How Did Key Segments Perform?

In the case of an inconclusive test, we want to look at individual segments of traffic.

For example, we have had an inconclusive test on smartphone traffic in which the Android visitors loved our variation, but iOS visitors hated it. They cancelled each other out. Yet we would have missed an important piece of information had we not looked more closely.

pasted image 0 39

Visitors react differently depending on their device, browser and operating system.

Other segments that may perform differently may include:

  1. Return visitors vs. New visitors
  2. Chrome browsers vs. Safari browsers vs. Internet Explorer vs. …
  3. Organic traffic vs. paid traffic vs. referral traffic
  4. Email traffic vs. social media traffic
  5. Buyers of premium products vs. non-premium buyers
  6. Home page visitors vs. internal entrants

These segments will be different for each business, but provide insights that spawn new hypotheses, or even provide ways to personalize the experience.

Understanding how different segments are behaving is fundamental to good testing analysis, but it’s also important to keep the main thing the main thing, as Rich Page explains:

[su_note note_color=”#dcf0df” text_color=”#000000″ radius=”10″]

rich-pageRich Page, Website Optimizer

Avoid analysis paralysis. Don’t slice the results into too many segments or different analytics tools. You may often run into conflicting findings. Revenue should always be considered the best metric to pay attention to other than conversion rate, after all, what good is a result with a conversion lift if it doesn’t also increase revenue?

The key thing is not to throw out A/B tests that have inconclusive results, as this will happen quite often. This is a great opportunity to learn and create a better follow up A/B test. In particular you should gain visitor feedback regarding the page being A/B tested, and show them your variations – this helps reveal great insights into what they like and don’t like. Reviewing related visitor recordings and click maps also gives good insights.

[/su_note]

Nick So of WiderFunnel talks about segments as well within his own process for AB test analysis:

[su_note note_color=”#dcf0df” text_color=”#000000″ radius=”10″]

nick-soNick So, WiderFunnel

“Besides the standard click-through rate, funnel drop-off, and conversion rate reports for post-test analysis, most of the additional reports and segments I pull are very dependent on the business context of a website’s visitors and customers.

For an ecommerce site that does a lot of email marketing and has high return buyers, I look at the difference in source traffic as well as new versus returning visitors. Discrepancies in behavior between segments can provide insights for future strategies, where you may want to focus on the behaviors of a particular segment in order to get that additional lift.

Sometimes, just for my own personal geeky curiosity, I look into seemingly random metrics to see if there are any unexpected patterns. But be warned: it’s easy to get too deep into that rabbit hole of splicing and dicing the data every which way to find some sort of pattern.

For lead-gen and B2B companies, you definitely want to look at the full buyer cycle and LTV of your visitors in order to determine the true winner of any experiment. Time and time again, I have seen tests that successfully increase lead submissions, only to discover that the quality of the leads coming through is drastically lower; which could cost a business MORE money in funnelling sales resources to unqualified leads.

In terms of post-test results analysis and validation — besides whatever statistical method your testing tool uses — I always run results through WiderFunnel’s internal results calculator which utilizes bayesian statistics to provide the risk and reward potential of each test. This allows you to make a more informed business decision, rather than simply a win/loss, significant/not significant recommendation.”

[/su_note]

In addition to understanding how tested changes impacted each segment, it’s also useful to understand where in the customer journey those changes had the greatest impact, as Benjamin Cozon describes:

[su_note note_color=”#dcf0df” text_color=”#000000″ radius=”10″]

benjamin-cozonBenjamin Cozon, Uptilab

We need to consider that the end of the running phase of a test is actually the beginning of insight analysis.

Why is each variation delivering a particular conversion rate? In which cases are my variations making a difference, whether positive or negative? In order to better understand the answers to these questions, we always try to identify which user segments are the most elastic to the changes that were made.

One way we do it is by ventilating the data with session-based or user-based dimensions. Here is some of the dimension we use for almost every test:

  • User type (new / returning)
  • Prospect / new Client / returning client
  • Acquisition channel
  • Type of landing page

This type of ventilation helps us understand the impact of specific changes for users relative to their specific place in the customer journey. Having these additional insights also helps us build a strong knowledge base and communicate effectively throughout the organization.

[/su_note]

Finally, while it is a great idea to have a rigorous quality assurance (QA) process for your tests, some may slip through the cracks. When you examine segments of your traffic, you may find one segment that performed very poorly. This may be a sign that the experience they saw was broken.

It is not unusual to see visitors using Internet Explorer crash and burn since developers abhor making customizations for that non-compliant browser.

How Did Changes Affect Lead Quality?

Post test analysis allows us to be sure that the quality of our conversions is high. It’s easy to increase conversions. But are these new conversions buying as much as the ones who saw the control?

Several of Conversion Sciences’ clients prizes phone calls and the company optimizes for them. Each week, the calls are examined to ensure the callers are qualified to buy and truly interested in a solution.

In post-test analysis, we can examine the average order value for each variation to see if buyers were buying as much as before.

We can look at the profit margins generated for the products purchased. If revenue per visit rose, did profit follow suit?

Marshall Downey of Build.com has some more ideas for us in the following instagraph infographic.

WTW TLE Post Test Analysis Instagraph Marshall Downy

Revenue is often looked to as the pre-eminent judge of lead quality, but doing so comes with it’s own pitfalls, as Ben Jesson describes in his approach to AB test analysis.

[su_note note_color=”#dcf0df” text_color=”#000000″ radius=”10″]

ben-jessonBen Jesson, Conversion Rate Experts

If a test doesn’t reach significance, we quickly move on to the next big idea. There are limited gains to be had from adding complexity by promoting narrow segments.

It can be priceless to run on-page surveys on the winning page, to identify opportunities for improving it further. Qualaroo and Hotjar are great for this.

Lead quality is important, and we like to tackle it from two sides. First, qualitatively: Does the challenger page do anything that is likely to reduce or increase the lead value? Second, quantitatively: How can we track leads through to the bank, so we can ensure that we’ve grown the bottom line?

You might expect that it’s better to measure revenue than to measure the number of orders. However, statistically speaking, this is often not true. A handful of random large orders can greatly skew the revenue figures. Some people recommend manually removing the outliers, but that only acknowledges the method’s intrinsic problem. How do you define outlier, and why aren’t we interested in them? If your challenger hasn’t done anything that is likely to affect the order size, then you can save time by using the number of conversions as the goal.

After every winning experiment, record the results in a database that’s segmented by industry sector, type of website, geographic location, and conversion goal. We have been doing this for a decade, and the value it brings to projects is priceless.

[/su_note]

Analyze AB Test Results by Time and Geography

Conversion quality is important, and  Theresa Baiocco takes this one step further.

[su_note note_color=”#dcf0df” text_color=”#000000″ radius=”10″]

theresa-baioccoTheresa Baiocco, Conversion Max

For lead gen companies with a primary conversion goal of a phone call, it’s not enough to optimize for quantity of calls; you have to track and improve call quality. And if you’re running paid ads to get those phone calls, you need to incorporate your cost to acquire a high-quality phone call, segmented by:

  • Hour of day
  • Day of week
  • Ad position
  • Geographic location, etc

When testing for phone calls, you have to compare the data from your call tracking software with the data from your advertising. For example, if you want to know which day of the week your cost for a 5-star call is lowest, you first pull a report from your call tracking software on 5-star calls by day of week:

image00

Then, check data from your advertising source, like Google AdWords. Pull a report of your cost by day of week for the same time period:

image01

Then, you simply divide the amount you spent by the number of 5-star calls you got, to find out how much it costs to generate a 5-star call each day of the week.

image02

Repeat the process on other segments, such as hour of day, ad position, week of the month, geographic location, etc. By doing this extra analysis, you can shift your advertising budget to the days, times, and locations when you generate the highest quality of phone calls – for less.

[/su_note]

Look for Unexpected Effects

Results aren’t derived in a vacuum. Any change will create ripple effects throughout a website, and some of these effects are easy to miss.

Craig Andrews gives us insight into this phenomenon via a recent discovery he made with a new client:

[su_note note_color=”#dcf0df” text_color=”#000000″ radius=”10″]

craig-andrewsCraig Andrews, allies4me

I stumbled across something last week – and I almost missed it because it was secondary effects of a campaign I was running. One weakness of CRO, in my honest opinion, is the transactional focus of the practice. CRO doesn’t have a good way of measuring follow-on effects.

For example, I absolutely believe pop-ups increase conversions, but at what cost? How does it impact future engagement with the brand? If you are selling commodities, then it probably isn’t a big concern. But most people want to build brand trust & brand loyalty.

We discovered a shocking level of re-engagement with content based on the quality of a visitor’s first engagement. I probably wouldn’t believe it if I hadn’t seen it personally and double-checked the analytics. In the process of doing some general reporting, we discovered that we radically increased the conversion rates of the 2 leading landing pages as secondary effects of the initial effort.

We launched a piece of content that we helped the client develop. It was a new client and the development of this content was a little painful with many iterations as everyone wanted to weigh in on it. One of our biggest challenges was getting the client to agree to change the voice & tone of the piece – to use shorter words & shorter sentences. They were used to writing in a particular way and were afraid that their prospects wouldn’t trust & respect them if they didn’t write in a highbrow academic way.

We completed the piece, created a landing page and promoted the piece primarily via email to their existing list. We didn’t promote any other piece of content all month. They had several pieces (with landing pages) that had been up all year.

It was a big success. It was the most downloaded piece of content for the entire year. It had more downloads in one month than any other piece had in total for the entire year. Actually, 28% more downloads than #2 which had been up since January.

But then, I discovered something else…

The next 2 most downloaded pieces of content spiked in October. In fact, 50% of the total year’s downloads for those pieces happened in October. I thought it may be a product of more traffic & more eyeballs. Yes that helped, but it was more than that. The conversion rates for those 2 landing pages increased 160% & 280% respectively!

We did nothing to those landing pages. We didn’t promote that content. We changed nothing except the quality of the first piece of content that we sent out in our email campaign.

Better writing increased the brand equity for this client and increased the demand for all other content.

[/su_note]

Testing results can also be compared against an archive of past results, as Shanelle Mullin discusses here:

[su_note note_color=”#dcf0df” text_color=”#000000″ radius=”10″]

Shanelle Mullin, ConversionXL

shanelle-mullinThere are two benefits to archiving your old test results properly. The first is that you’ll have a clear performance trail, which is important for communicating with clients and stakeholders. The second is that you can use past learnings to develop better test ideas in the future and, essentially, foster evolutionary learning.

The clearer you can communicate the ROI of your testing program to stakeholders and clients, the better. It means more buy-in and bigger budgets.

You can archive your test results in a few different ways. Tools like Projects and Effective Experiments can help, but some people use plain ol’ Excel to archive their tests. There’s no single best way to do it.

What’s really important is the information you record. You should include: the experiment date, the audience / URL, screenshots, the hypothesis, the results, any validity factors to consider (e.g. a PR campaign was running, it was mid-December), a link to the experiment, a link to a CSV of the results, and insights gained.

[/su_note]

Why Did We Get The Result We Got?

Ultimately, we want to answer the question, “Why?” Why did one variation win and what does it tell us about our visitors?

This is a collaborative process and speculative in nature. Asking why has two primary effects:

  1. It develops new hypotheses for testing
  2. It causes us to rearrange the hypothesis list based on new information

Our goal is to learn as we test, and asking “Why?” is the best way to cement our learnings.


21 Quick and Easy CRO Copywriting Hacks to Skyrocket Conversions

FREE: Click to Download

21 Quick and Easy CRO Copywriting Hacks

Keep these proven copywriting hacks in mind to make your copy convert.

  • 43 Pages with Examples
  • Assumptive Phrasing
  • "We" vs. "You"
  • Pattern Interrupts
  • The power of Three

A/B testing statistics made simple. A guide that will clear up some of the more confusing concepts while providing you with a solid framework to AB test effectively.

Here’s the deal. You simply cannot A/B test effectively without a sound understanding of A/B testing statistics.

And while there has been a lot of exceptional content written on AB testing statistics, I’ve found that most of these articles are either overly simplistic or they get very complex without anchoring each concept to a bigger picture.

Today, I’m going to explain the statistics of AB testing within a linear, easy-to-follow narrative. It will cover everything you need to use AB testing software effectively.

You might have been told that plugging a few numbers into a statistical significance calculator is enough to validate a test. Or perhaps you see the green “test is significant” checkmark popup on your testing dashboard and immediately begin preparing the success reports for your boss.

In other words, you might know just enough about split testing statistics to dupe yourself into making major errors, and that’s exactly what I’m hoping to save you from today.

Here’s my best attempt at making statistics intuitive.

Why Statistics Are So Important To A/B Testing

The first question that has to be asked is “Why are statistics important to AB testing?”

The answer to that questions is that AB testing is inherently a statistics-based process. The two are inseparable from each other.

An AB test is an example of statistical hypothesis testing, a process whereby a hypothesis is made about the relationship between two data sets and those data sets are then compared against each other to determine if there is a statistically significant relationship or not.

To put this in more practical terms, a prediction is made that Page Variation #B will perform better than Page Variation #A, and then data sets from both pages are observed and compared to determine if Page Variation #B is a statistically significant improvement over Page Variation #A.

This process is an example of statistical hypothesis testing.

But that’s not the whole story. The point of AB testing has absolutely nothing to do with how variations #A or #B perform. We don’t care about that.

What we care about is how our page will ultimately perform with our entire audience.

And from this birdseye view, the answer to our original question is that statistical analysis is our best tool for predicting outcomes we don’t know using information we do know.

For example, we have no way of knowing with 100% accuracy how the next 100,000 people who visit our website will behave. That is information we cannot know today, and if we were to wait o until those 100,000 people visited our site, it would be too late to optimize their experience.

What we can do is observe the next 1,000 people who visit our site and then use statistical analysis to predict how the following 99,000 will behave.

If we set things up properly, we can make that prediction with incredible accuracy, which allows us to optimize how we interact with those 99,000 visitors. This is why AB testing can be so valuable to businesses.

In short, statistical analysis allows us to use information we know to predict outcomes we don’t know with a reasonable level of accuracy.

The Complexities Of Sampling, Simplified

That seems fairly straightforward, so where does it get complicated?

The complexities arrive in all the ways a given “sample” can inaccurately represent the overall “population”, and all the things we have to do to ensure that our sample can accurately represent the population.

Let’s define some terminology real quick.

AB testing statistics: The Complexities Of Sampling, Simplified

The “population” is the group we want information about. It’s the next 100,000 visitors in my previous example. When we’re testing a webpage, the true population is every future individual who will visit that page.

The “sample” is a small portion of the larger population. It’s the first 1,000 visitors we observe in my previous example.

In a perfect world, the sample would be 100% representative of the overall population.

For example:

Let’s say 10,000 out of those 100,000 visitors are going to ultimately convert into sales. Our true conversion rate would then be 10%.

In a tester’s perfect world, the mean (average) conversion rate of any sample(s) we select from the population would always be identical to the population’s true conversion rate. In other words, if you selected a sample of 10 visitors, 1 of them (10%) would buy, and if you selected a sample of 100 visitors, then 10 would buy.

But that’s not how things work in real life.

In real life, you might have only 2 out of the first 100 buy or you might have 20… or even zero. You could have a single purchase from Monday through Friday and then 30 on Saturday.

This variability across samples is expressed as a unit called the “variance”, which measures how far a random sample can differ from the true mean (average).

The Freakonomics podcast makes an excellent point about what “random” really is. If you have one person flip a coin 100 times, you would have a random list of heads or tails with a high variance.

If we write these results down, we would expect to see several examples of long streaks, five or seven or even ten heads in a row. When we think of randomness, we imagine that these streaks would be rare. Statistically, they are quite possible in such a dataset with high variance.

The higher the variance, the more variable the mean will be across samples. Variance is, in some ways, the reason statistical analysis isn’t a simple process. It’s the reason I need to write an article like this in the first place.

So it would not be impossible to take a sample of ten results that contain one of these streaks. This would certainly not be representative of the entire 100 flips of the coin, however.

Fortunately, we have a phenomenon that helps us account for variance called “regression toward the mean”.

Regression toward the mean is “the phenomenon that if a variable is extreme on its first measurement, it will tend to be closer to the average on its second measurement.”

Ultimately, this ensures that as we continue increasing the sample size and the length of observation, the mean of our observations will get closer and closer to the true mean of the population.

In other words, if we test a big enough sample for a sufficient length of time, we will get accurate “enough” results.

So what do I mean by accurate “enough”?

Understanding Confidence Intervals & Margin of Error

In order to compare two pages against each other in an Ab test, we have to first collect data on each page individually.

Typically, whatever AB testing tool you are using will automatically handle this for you, but there are some important details that can affect how you interpret results, and this is the foundation of statistical hypothesis testing, so I want to go ahead and cover this part of the process.

Let’s say you test your original page with 3,662 visitors and get 378 conversions. What is the conversion rate?

You are probably tempted to say 10.3%, but that’s inaccurate. 10.3% is simply the mean of our sample. There’s a lot more to the story.

To understand the full story, we need to understand two key terms:

  1. Confidence Interval
  2. Margin of Error

You may have seen something like this before in your split testing dashboard.

The original page above has a conversion rate of 10.3% plus or minus 1.0%. The 10.3% conversion rate value is the mean. The ± 1.0 % is the margin for error, and this gives us a confidence interval spanning from 9.3% to 11.3%.

10.3% ± 1.0 % at 95% confidence is our actual conversion rate for this page.

What we are saying here is that we are 95% confident that the true mean of this page is between 9.3% and 11.3%. From another angle, we are saying that if we were to take 20 total samples, we can know with complete certainty that the sample conversion rate would fall between 9.3% and 11.3% in at least 19 of those samples.

The confidence interval is an observed range in which a given percentage of test outcomes fall. We manually select our desired confidence level at the beginning of our test, and the size of the sample we need is based on our desired confidence level.

The range of our confidence level is then calculated using the mean and the margin of error.

The easiest way to demonstrate this with a visual.

Confidence interval example | A/B Testing Statistics

The confidence level is decided upon ahead of time and based on direct observation. There is no prediction involved. In the above example, we are saying that 19 out of every 20 samples tested WILL, with 100% certainty, have an observed mean between 9.3% and 11.3%.

The upper bound of the confidence interval is found by adding the margin of error to the mean. The lower bound is found by subtracting the margin of error from the mean.

The margin for error is a function of the standard deviation, which is a function of the variance. Really all you need to know is that all of these terms are measures of variability across samples.

Confidence levels are often confused with significance levels (which we’ll discuss in the next section) due to the fact that the significance level is set based on the confidence level, usually at 95%.

You can set the confidence level to be whatever you like. If you want 99% certainty, you can achieve it, BUT it will require a significantly larger sample size. As the chart below demonstrates, diminishing returns make 99% impractical for most marketers, and 95% or even 90% is often used instead for a cost-efficient level of accuracy.

In high-stakes scenarios (live-saving medicine, for example), testers will often use 99% confidence intervals, but for the purposes of the typical CRO specialist, 95% is almost always sufficient.

Advanced testing tools will use this process to measure the sample conversion rate for both the original page AND Variation B, so it’s not something you are really going to ever have to calculate on your own, but this is how our process starts, and as we’ll see in a bit, it can impact how we compare the performance of our pages.

Once we have our conversion rates for both the pages we are testing against each other, we use statistical hypothesis testing to compare these pages and determine whether the difference is statistically significant.

Important Note About Confidence Intervals

It’s important to understand the confidence levels your AB testing tools are using and to keep an eye on the confidence intervals of your pages’ conversion rates.

If the confidence intervals of your original page and Variation B overlap, you need to keep testing even if your testing tool is saying that one is a statistically significant winner.

Significance, Errors, & How To Achieve The Former While Avoiding The Latter

Remember, our goal here isn’t to identify the true conversion rate of our population. That’s impossible.

When running an AB test, we are making a hypothesis that Variation B will convert at a higher rate for our overall population than Variation A will. Instead of displaying both pages to all 100,000 visitors, we display them to a sample instead and observe what happens.

  • If Variation A (the original) had a better conversion rate with our sample of visitors, then no further actions need to be taken as Variation A is already our permanent page.
  • If Variation B had a better conversion rate, then we need determine whether the improvement was statistically large “enough” for us to conclude that the change would be reflected in the larger population and thus warrant us changing our page to Variation B.

So why can’t we take the results at face value?

The answer is variability across samples. Thanks to the variance, there are a number of things that can happen when we run our AB test.

  1. Test says Variation B is better & Variation B is actually better
  2. Test says Variation B is better & Variation B is not actually better (type I error)
  3. Test says Variation B is not better & Variation B is actually better (type II error)
  4. Test says Variation B is not better & Variation B is not actually better

As you can see, there are two different types of errors that can occur. In examining how we avoid these errors, we will simultaneously be examining how we run a successful AB test.

Before we continue, I need to quickly explain a concept called the null hypothesis.

The null hypothesis is a baseline assumption that there is no relationship between two data sets. When a statistical hypothesis test is run, the results either disprove the null hypothesis or they fail to disprove the null hypothesis.

This concept is similar to “innocent until proven guilty”: A defendant’s innocence is legally supposed to be the underlying assumption unless proven otherwise.

For the purposes of our AB test, it means that we automatically assume Variation B is NOT a meaningful improvement over Variation A. That is our null hypothesis. Either we disprove it by showing that Variation B’s conversion rate is a statistically significant improvement over Variation A, or we fail to disprove it.

And speaking of statistical significance…

Type I Errors & Statistical Significance

A type I error occurs when we incorrectly reject the null hypothesis.

To put this in AB testing terms, a type I error would occur if we concluded that Variation B was “better” than Variation A when it actually was not.

Remember that by “better”, we aren’t talking about the sample. The point of testing our samples is to predict how a new page variation will perform with the overall population. Variation B may have a higher conversion rate than Variation A within our sample, but we don’t truly care about the sample results. We care about whether or not those results allow us to predict overall population behavior with a reasonable level of accuracy.

So let’s say that Variation B performs better in our sample. How do we know whether or not that improvement will translate to the overall population? How do we avoid making a type I error?

Statistical significance.

Statistical significance is attained when the p-value is less than the significance level. And that is way too many new words in one sentence, so let’s break down these terms real quick and then we’ll summarize the entire concept in plain English.

The p-value is the probability of obtaining at least as extreme results given that the null hypothesis is true.

In other words, the p-value is the expected fluctuation in a given sample, similar to the variance. Imagine running an A/A test, where you displayed your page to 1,000 people and then displayed the exact same page to another 1,000 people.

You wouldn’t expect the sample conversion rates to be identical. We know there will be variability across samples. But you also wouldn’t expect it be drastically higher or lower. There is a range of variability that you would expect to see across samples, and that, in essence, is our p-value.

The significance level is the probability of rejecting the null hypothesis given that it is true.

Essentially, the significance level is a value we set based on the level of accuracy we deem acceptable. The industry standard significance level is 5%, which means we are seeking results with 95% accuracy.

So, to answer our original question:

We achieve statistical significance in our test when we can say with 95% certainty that the increase in Variation B’s conversion rate falls outside the expected range of sample variability.

Or from another way of looking at it, we are using statistical inference to determine that if we were to display Variation A to 20 different samples, at least 19 of them would convert at lower rates than Variation B.

Type II Errors & Statistical Power

A type II error occurs when the null hypothesis is false, but we incorrectly fail to reject it.

To put this in AB testing terms, a type II error would occur if we concluded that Variation B was not “better” than Variation A when it actually was better.

Just as type I errors are related to statistical significance, type II errors are related to statistical power, which is the probability that a test correctly rejects the null hypothesis.

For our purposes as split testers, the main takeaway is that larger sample sizes over longer testing periods equal more accurate tests. Or as Ton Wesseling of Testing.Agency says here:

You want to test as long as possible – at least 1 purchase cycle – the more data, the higher the Statistical Power of your test! More traffic means you have a higher chance of recognizing your winner on the significance level your testing on!

Because…small changes can make a big impact, but big impacts don’t happen too often – most of the times, your variation is slightly better – so you need much data to be able to notice a significant winner.

Statistical significance is typically the primary concern for AB testers, but it’s important to understand that tests will oscillate between being significant and not significant over the course of a test. This is why it’s important to have a sufficiently large sample size and to test over a set time period that accounts for the full spectrum of population variability.

For example, if you are testing a business that has noticeable changes in visitor behavior on the 1st and 15th of the month, you need to run your test for at least a full calendar month.  This is your best defense against one of the most common mistakes in AB testing… getting seduced by the novelty effect.

Peter Borden explains the novelty effect in this post:

Sometimes there’s a “novelty effect” at work. Any change you make to your website will cause your existing user base to pay more attention. Changing that big call-to-action button on your site from green to orange will make returning visitors more likely to see it, if only because they had tuned it out previously. Any change helps to disrupt the banner blindness they’ve developed and should move the needle, if only temporarily.

More likely is that your results were false positives in the first place. This usually happens because someone runs a one-tailed test that ends up being overpowered. The testing tool eventually flags the results as passing their minimum significance level. A big green button appears: “Ding ding! We have a winner!” And the marketer turns the test off, never realizing that the promised uplift was a mirage.

By testing a large sample size that runs long enough to account for time-based variability, you can avoid falling victim to the novelty effect.

Important Note About Statistical Significance

It’s important to note that whether we are talking about the sample size or the length of time a test is run, the parameters for the test MUST be decided on in advance.

Statistical significance cannot be used as a stopping point or, as Evan Miller details, your results will be meaningless.

As Peter alludes to above, many AB testing tools will notify you when a test’s results become statistical significance. Ignore this. Your results will often oscillate between being statistically significant and not being statistically significant.

The only point at which you should evaluate significance is the endpoint that you predetermined for your test.

Terminology Cheat Sheet

We’ve covered quite a bit today.

For those of you who have just been smiling and nodding whenever statistics are brought up, I hope this guide has cleared up some of the more confusing concepts while providing you with a solid framework from which to pursue deeper understanding.

If you’re anything like me, reading through it once won’t be enough, so I’ve gone ahead and put together a terminology cheat sheet that you can grab. It lists concise definitions for all the statistics terms and concepts we covered in this article.

  • Download The Cheat Sheet

    testing-statistics-cheat-sheet
    A concise list of statistics terminology to take with you for easy reference.

Correlation and causation are two very different things. Often correlation is at work while the causation is not. By understanding how to identify them, we can master correlation, causation and the decisions they drive.

In 2008, Hurricane Ike stormed his way through the Gulf of Mexico, striking the coasts of Texas and Louisiana. This powerful Category 3 hurricane took 112 lives, making Ike the seventh most deadly hurricane in recent history.

Ike stands alone in one other way: It is the only storm with a masculine name in the list of ten most deadly storms since 1950. For all of his bravado, Ike killed fewer people than Sandy, Agnes, the double-team of Connie and Dianne, Camile, Audrey and Katrina. Here are the top ten most deadly hurricanes according to a video published by the Washington Post.

If we pull the data for the top ten hurricanes since 1950 from

#10-Carol: 1954, 65 Deaths

#9-Betsy: 1965, 75 Deaths

#8-Hazel, 1954, 95 Deaths

#7-Ike 2008, 112 Deaths

#6-Sandy 2012, 117 Deaths

#5-Agnes, 1972, 122 Deaths

#4-Connie and Dianne, 1955, 184 Deaths

#3-Camille, 1969, 265 Deaths

#2-Audrey, 1957, 416 Deaths

#1-Katrina, 2005, 1833 Deaths

There is a clear correlation in this data, and in data collected on 47 other hurricanes. Female-named hurricanes kill 45 people on average, while the guys average only 23.

Heav’n has no Rage, like Love to Hatred turn’d,

Nor Hell a Fury, like a Woman scorn’d. — William Congreve

Now, if we assume causation is at work as well, an answer to our problem presents itself quite clearly: We should stop giving hurricanes feminine names because it makes them meaner. Clearly, hurricanes are affected by the names we give them, and we can influence the weather with our naming conventions.

You may find this conclusion laughable, but what if I told you that secondary research proved the causation, that we can reduce deaths by as much as two thirds simply by changing Hurricane Eloise to Hurricane Charley. It appears that hurricanes are sexist, that they don’t like being named after girls, and get angry when we do so.

Our minds don’t really like coincidence, so we try to find patterns where maybe there isn’t one. Or we see a pattern, and we try to explain why it’s happening because once we explain it, it feels like we have a modicum of control. Not having control is scary.

As it turns out, The Washington Post published an article about the relationship between the gender of hurricanes’ names and the number of deaths the hurricane causes. The article’s title is “Female-named hurricanes kill more than male hurricanes because people don’t respect them, study finds.” The opening sentence clears up confusion you might get from the title: “People don’t take hurricanes as seriously if they have a feminine name and the consequences are deadly, finds a new groundbreaking study.”

The Difference Between Correlation and Causation

Another way to phrase the Washington Post’s conclusion is, The number of hurricane-related deaths depends on the gender of the hurricane’s name. This statement demonstrates a cause/effect relationship where one thing – the number of deaths – cannot change unless something else – the hurricane’s name – behaves a certain way (in this case, it becomes more or less feminine).

If we focus on decreasing hurricane-related deaths, we can make changes to the naming convention that will that try to take people’s implicit sexism out of the picture. We could:

  • Make all the names either male or female instead of alternating
  • Choose names that are gender non-specific
  • Change the names to numbers
  • Use date of first discovery as identification
  • Use random letter combinations
  • Use plant or animal names

What is Correlation?

In order to calculate a correlation, we must compare two sets of data. We want to know if these two datasets correlate or change together. the graph below is an example of two datasets that correlate visually.

Graph from Google Analytics showing two datasets that appear to correlate.

Graph from Google Analytics showing two datasets that appear to correlate.

In this graph of website traffic, our eyes tell us that the Blue and Orange data change at the same time and with the same magnitude from day to day. Incidentally, causation is at play here as well. The Desktop + Tablet Sessions data is part of All Sessions so the latter depends on the former.

How closely do these two lines correlate? We can find out with some help from a tool called a scatter plot. These are easy to generate in Excel. In a scatter plot, one dataset is plotted along the horizontal axis and the other is graphed along the vertical axis. In a typical graph, the vertical value, called y depends on the horizontal value, usually called x. In a scatter plot, the two are not necessarily dependent on each other. If two datasets are identical, then the scatter plot is a straight line. The following image shows the scatter plot of two datasets that correlate well.

The scatter plot of two datasets with high correlation.

The scatter plot of two datasets with high correlation.

In contrast, here is the scatter plot of two datasets that don’t correlate.

The scatter plot of two datasets with a low correlation.

The scatter plot of two datasets with a low correlation.

The equations you see on these graphs include and Rthat is  calculated by Excel for us when we add a Trendline to the graph. The closer this value is to 1, the higher the statistical correlation. You can see that the first graph has an R2 of 0.946 — close to 1 — while the second is 0.058. We will calculate a correlation coefficient and use a scatter plot graph to visually inspect for correlations.

For data that shows a strong correlaton, we can then look for evidence proving or disproving causation.

Errors in Correlation, Causation

Causation can masquerade as a number of other effects:

  1. Coincidence: Sometimes random occurrences appear to have a causal relationship.
  2. Deductive Error: There is a causal relationship, but it’s not what you think.
  3. Codependence: An external influence, a third variable, on the which two correlated things depend.

Errors of codependence result from an external stimuli that affects both datasets equally. Here are some examples.

Math scores are higher when children have larger shoe sizes.

Can we assume larger feet cause increased capacity for math?

Possible third variable: Age; children’s feet get bigger when they get older.

Enclosed dog parks have higher incidents of dogs biting other dogs/people.

Can we assume enclosed dog parks cause aggression in dogs?

Possible third variable: Attentiveness of owners; pet owners might pay less attention to their dogs’ behavior when there is a fence around the dog park.

Satisfaction rates with airlines steadily increase over time.

Can we assume that airlines steadily improve their customer service?

Possible third variable: Customer expectations; customers may have decreasing expectations of customer service over time.

The burden of proof is on us to prove causation and to eliminate these alternative explanations.

How to Prove Causation When All You Have is Correlation

As we have said, when two things correlate, it is easy to conclude that one causes the other. This can lead to errors in judgement. We need to determine if one thing depends on the other. If we can’t prove this with some confidence, it is safest to assume that causation doesn’t exist.

1. Evaluate the Statistics

Most of our myths, stereotypes and superstitions can be traced to small sample sizes. Our brains are wired to find patterns in data, and if given just a little data, our brains will find patterns that don’t exist.

The dataset of hurricanes used in the Washington Post article contains 47 datapoints. That’s a very small sample to be making distinctions about. It’s easier to statistically eliminate causation as an explanation than it is to prove it causation.

For example, people avoid swimming in shark infested waters is likely to cause death by shark. Yet they don’t avoid walking under coconut trees because, “What are the odds” that a coconut will kill you. As it turns out, there are 15 times more fatalities each year from falling coconuts than from shark attacks.

If you’re dealing with less than 150 data points — the coconut line — then you probably don’t need to even worry about whether one thing caused the other. In this case, you may not be able to prove correlation, let alone causation.

2. Find Another Dataset

In the case of hurricanes, we have two datasets: The number of deaths and weather or not the hurricane was named after a boy or a girl.

The relationship between a hurricane's name and hurricane deaths.

The relationship between a hurricane’s name and hurricane deaths.

The correlation is pretty obvious. This is binary: either the storm has a man’s name or a woman’s name. However, this becomes a bit clouded when you consider names like Sandy and Carol, which are names for both men and women. We need need a dataset that measures our second metric with more granularity if we’re going to calculate a correlation.

Fortunately, we have the web. I was able to find another dataset that rated names by masculinity. Using the ratings found on the site behindthename.com, we graphed femininity vs. death toll. Because of the outlier, Katrina, we used a logarithmic scale.

There is little statistical correlation between masculinity and death toll. Causation is in question.

There is little statistical correlation between masculinity and death toll. Causation is in question.

I created a trend line for this data and asked Excel to provide a coefficient of determination, or an R-squared value. As you remember, the closer this number is to 1, the higher the two datasets correlate. At 0.0454, there’s not a lot of correlation here.

Researchers at the University of Illinoise and Arizona State University did the same thing as a part of their study, according to the Washington Post story. They found the opposite result. “The difference in death rates between genders was even more pronounced when comparing strongly masculine names versus strongly feminine ones.” They were clearly using a different measure of “masculinity” to reach their conclusion.

What else could we do to test causation?

3. Create Another Dataset Using AB Testing

Sometimes, we need to create a dataset that verifies causation. The researchers in our Washington Post study did this. They setup experiments “presenting a series of questions to between 100 and 346 people.” They found that the people in their experiments predicted that male-named hurricanes would be more intense, and that they would prepare less for female-named hurricanes.

In short, we are all sexist. And it’s killing us.

Running an experiment is a great way to generate more data about a correlation in order to establish causation. When we run an AB test, we are looking for a causation, but will often settle for correlation. We want to know if one of the changes we make to a website causes an increase in sales or leads.

We can deduce causation by limiting the number of things we change to one per treatment.

AB Testing Example: Correlation or Causation

One of the things we like to test is the importance of findability on a website. We want to discern how important it is to help visitors find things on a site. For a single product site, findability is usually not important. If we add search features to the site, conversions or sales don’t rise.

For a catalog ecommerce site with hundreds or thousands of products, findability may be a huge deal. Or not.

We use a report found in Google Analytics that compares the conversion rate of people who search against all visitors.

This report shows that "users" who use the site search function on a site buy more often and make bigger purchases when they buy.

This report shows that “users” who use the site search function on a site buy more often and make bigger purchases when they buy.

This data includes hundreds of data points over several months, so it is statistically sound. Is it OK, then, to assume that if we get more visitors to search, we’ll see an increase in purchases and revenue? Can we say that searching causes visitors to buy more, or is it that buyers use the site search feature more often?

In this case, we needed to collect more information. If search causes an increase in revenue, then if we make site search more prominent, we should see an increase in transactions and sales. We designed two AB tests to find out.

In one case, we simplified the search function of the site and made the site search field larger.

This AB Test helped identify causation by increasing searches and conversions.

This AB Test helped identify causation by increasing searches and conversions.

Being the skeptical scientists that we are, we defined another AB test to help establish causation. We had a popover appear when a visitor was idle for more than a few seconds. The popover offered the search function.

This AB test increased the number of searchers and increased revenue per visit.

This AB test increased the number of searchers and increased revenue per visit.

At this point, we had good evidence that site search caused more visitors to buy and to purchase more.

Another AB Testing Example

The point of AB testing is to make changes and be able to say with confidence that what you did caused conversion rates to change. The conversion rate may have plummeted or skyrocketed or something in between, but it changed because of something you did.

One of our clients had a sticky header sitewide with three calls-to-action: Schedule a Visit, Request Info, and Apply Now. Each of these three CTAs brought the visitor to the exact same landing page.

The choices shown on this page may have overwhelmed visitors

The choices shown on this page may have overwhelmed visitors.

We hypothesized that multiple choices were overwhelming visitors, and they were paralyzed by the number of options. We wanted to see if fewer options would lead to more form fills. To test this hypothesis, we only changed one thing for our AB test: we removed “Apply Now”.

After this change we saw a 36.82% increase in form fills. The conversion rate went from 4.9% to 6.71%.

Phrased differently: The number of form fills depends on the number of CTAs.

We get the terms Dependent Variable and Independent Variable from this kind of cause/effect relationship.

The number of CTAs is the independent variable because we – the people running the test – very intentionally changed it.

The number of form fills is the dependent variable it depended on the number of CTAs. Changes to the dependent variable happen indirectly. A researcher can’t reach in and just change it.

Make sense?

This is called a causal relationship because one variable causes another to change.

4. Prove the “Why?”

If you have a set of data that seems to prove causation, you are left with the need to answer the questions, “Why?”

Why do female-named hurricanes kill more people? The hypothesis we put forward at the beginning of this article was that girly names make hurricanes angry and more violent. There is plenty of evidence from the world of physics that easily debunks this theory. We chose it because it was absurd, and we hoped an absurdity would get you to read this far. (SUCCESS!)

The researchers written about by the Washington Post came up with a more reasonable explanation: that the residents in the path of such storms are sexist, and prepare less for feminine-sounding hurricanes. However, even this reasonable explanation needed further testing.

The problem with answering the question, “Why?” in a reasonable way is that our brains will decide that it is the answer just because it could be the answer. Walking at night causes the deaths of more pedestrians than walking in daylight. If I told you it was because more pedestrians drink at night and thus blunder into traffic, you might stop all analysis at that point. However, the real reason may be that cars have more trouble seeing pedestrians at night than in the daytime.

Don’t believe the first story you hear or you’ll believe that hurricanes hold a grudge. Proving the “Why” eliminates errors of deduction.

Does Watching Video Cause More Conversions?

We did a AB test for the site Automatic.com in which we replaced an animation with a video that explains the benefits of their adapter that connects your smartphone to your car. In this test, the treatment with the video generated significantly more revenue than the control.

Replacing the animation (left) with a video (right) on one site increased revenue per visit.

Replacing the animation (left) with a video (right) on one site increased revenue per visit.

Our test results demonstrate a correlation between video on the home page and an increase in revenue per visitor. It is a natural step to assume that the video caused more visitors to buy. Based on this, we might decide to test different kinds of video, different lengths, different scripts, etc.

As we now know, correlation is not causation. What additional data could we find to verify causation before we invest in additional video tests?

We were able to find an additional dataset. The video player provided by Wistia tracked the number of people who saw the video on the page vs. the number of people who watched the video. What we learned was that only 9% of visitors actually clicked play to start watching the video.

Even though conversions rose, there were few plays of the video

Even though conversions rose, there were few plays of the video.

So, the video content was only impacting a small number of visitors. Even if every one of these watchers bought, it wouldn’t account for the increase in revenue. Here, the 9% play rate is the number of unique plays divided by the number of unique page loads.

A more likely scenario is that the animation had a negative impact on conversion vs. the static video title card image. Alternatively, the load time of the animation may have allowed visitors to scroll past before seeing it.

Nonetheless, had we continued with our deduction error, we might have invested heavily in video production to find more revenue when changing the title card for this video is all we needed.

 Back to Hurricanes

The article argues: The number of hurricane-related deaths depends on the gender of the hurricane’s name.

Do you see any holes in this conclusion?

These researchers absolutely have data that say in no uncertain terms that hurricanes with female names have killed more people, but have they looked closely enough to claim that the name causes death? Let’s think about what circumstances would introduce a third variable each time a hurricane makes landfall.

  • Month of year (is it the beginning or end of hurricane season?)
  • Position in lunar phase (was there a full moon?)
  • Location of landfall

If we only consider location of landfall, there are several other third variables to consider:

  • Amount of training for emergency personnel
  • Quality of evacuation procedures
  • Average level of education for locals
  • Average socio-economic status of locals
  • Proximity to safe refuge
  • Weather patterns during non-hurricane seasons

I would argue that researchers have a lot more work to do if they really want to prove that femininity of a hurricane’s name causes a bigger death toll. They would need to make sure that only variable they are changing is the name, not any of these third variables.

Unfortunately for environmental scientists and meteorologists, it’s really difficult to isolate variables for natural disasters because it’s not an experiment you can run in a lab. You will never be able to create a hurricane and repeatedly unleash it on a town in order to see how many people run. It’s not feasible (nor ethical).

Fortunately for you, it’s a lot easier when you’re AB testing your website.