It Looks Like a Valid Split Test, but You May Be Missing Something.

Split tests are a favored tool of economists, direct response mailers and website optimizers. The primary challenge when creating a split test is being sure that you’re measuring exactly what you think you’re measuring.

As my grandmother used to say, “The road to bad conclusions is paved with uncontrolled variables.” Yes, we are a geeky family.

When your variables get out of hand, all hell breaks lose, and you end up testing something you didn’t intend to test.

When comparing humans, nothing beats twins for experimenting. I feel sorry for twins because of the number of medical and economic studies that tap them for “controlled” experiments. It must be like living on the Island of Dr. Moreau.

The following YouTube video uses twins to perform an experiment that tests the social axiom that, “Chewing gum isn’t attractive.” Watch the following video and see if you can name some of the uncontrolled variables that may be putting this experiment on the statistical road to invalidity.

This is an art exhibit, not a controlled experiment. However, many who watch it may believe that it provides scientific evidence that we should all go out and pick up a few sticks of Wrigley’s if we want to be more charming. In fact, that is what the makers of Beldent want you to believe.

Is this test really telling us to put our jaws in motion if we’re to be more likable?

Controlling Variables

The artist who created this experimart controlled for appearance quite effectively. Genetically identical twins are used as the control and the variation. They are dressed identically. They wear identical makeup. They are sitting in a neutral position on identical chairs. They have identical facial expressions. The lighting is the same on both participants.

The only apparent difference is that one of them is chewing delicious Beldent gum (Beldent is the South American brand from Trident).

Test subjects sit in front of these two doppelgangers and listen to questions piped in through a headset.

“Which one seems like he has more friends?”

“Which one has more imaginary friends?”

“Which one gets invited to more parties?”

“Which one gets invited to more bridge tournaments?”

“Which of these bosses would give you a raise?”

Participants chose the left or the right by pressing one of two buttons vaguely reminiscent of The Family Feud.

It looks like all variables have been controlled, and that the results can be considered valid.

The Unsuspecting Twists that Ruin Experiments

One thing experienced testers learn quickly is that little things will influence your tests more than you would have imagined.

For instance, this experiment could be testing if people favor pressing the left button when they are unsure of their answer. In every situation, the gum chewer is sitting on the left. To control for this, the gum chewer should have been on the left sometimes and on the right sometimes.

This experiment could be comparing gum chewers to people who look like total bummers. In every situation, the control twin is sitting with a neutral expression. In any situation, they would look boring. To control for this, they could have asked the twins to be smiling sometimes.

The handlers who setup the test, a test paid for by Beldent, may have unconsciously chosen the more attractive twin of each pair to be the gum-chewer. Certainly, the artists wanted Beldent to win. This creates a bias. To control for this, the twins should have taken turns as the gum chewer.

The designers of the test attempt to control for age and gender biases. A mix of twins is used: men, women, older, younger.

The people pressing the buttons may represent a skewed sample set, however. The experiment was done in a museum. Only a portion of the total population enjoys museums. Thus, the test subjects are probably not representative of the population as a whole.

Basically, Beldent can only conclude that chewing gum makes you more attractive to museum-goers when you’re sitting to the left of your boring twin.

A More Rigorous Experiment

To really get a handle on the social benefits of constant mastication, you would design a test that controlled for even more variables.

Participants should be shown only one twin. This, of course, means you don’t need twins. Just one person from each set, chewing, not chewing; smiling and not smiling.

Questions would have to be reworded.

“Does this person make friends easily?”

“Does this person get invited to parties frequently?”

I would scratch the “bridge tournament” question. You’ve got to be smart to play bridge well. Smart people are likable, right?

This new design would require a larger sample size than the 481 participants in the Beldent artsperiment. You now have sixteen different treatments (four chewing, four not cheweing; four chewing and smiling; four not chewing and smiling). You would need 1600 participants or more to get close to statistical significance.

Unfortunately, this new experiment wouldn’t deliver such dramatic video footage. But without this rigor, Beldent may be lying to themselves — and to us — about the value of chewing gum.

Imagine Your Web Pages as Twins

Our job at Conversion Sciences is to design website tests for companies, tests that tell us exactly what we want to know about a web page and nothing more. We agonize over the subtle things that introduce bias into our test. We always want to test the right thing.

We create two or more versions of a page that are like identical twins, with only one thing changed. We make sure that the visitors in our tests are representative of the site’s visitors at large.

And we must control our natural human biases. Like the artists who setup the experiment, we want to get wins for our clients. If we give in to this desire, we can find ourselves calling tests too soon, or running with statistically insignificant results that favor our treatment. Even when our testing tools tell us we’ve got a winner, we are skeptical.

It isn’t hard to tell if we make a bad call. If our tests didn’t increase the fortunes of our clients, our reputation isn’t worth a p-value. The accounting department’s numbers don’t lie.

Let us test some pages for you. If you have five-hundred transactions a month or more, you have the sample size to test your way to more sales, more leads and more subscribers. And we’ll throw in some chewing gum.

