Your Heading Here
You need to run experiments. The one who runs the most experiments wins.
BUT – most marketing experiments are done wrong. What’s missing is hypothesis driven testing across all inter-business disciplines.
Nearly everything we use in our daily lives came about as a direct result of an experiment. The cars we drive, the computer or mobile device you are using to read this sentence, the Great Wall of China, Abercrombie and Fitch advertisements – all of them have gone through some form of the scientific method applied many times over a period of months, years, or even decades.
“You want experimentation. Every once in awhile, you stumble upon something that blows your mind.” – Jeremy Stoppelman, CEO of Yelp
Human beings use informal experiments all the time as a way to test-drive decisions that may have long reaching consequences. In order to decide whether we will take a certain action, we first construct a hypothesis, run an experiment, analyze the results, and then make a decision based on the available data. Most of the time, this process happens unconsciously and in only a few seconds.
Relatively recently, conversion optimization specialists have discovered the immense power of experimentation when applied to an online environment in the form of A/B Testing, Usability Testing, and more. In this course we will review the power of experimentation in the digital landscape while also examining wider reaching methods companies can use to harness it’s power.
By the end of this essay you will understand:
- How to create a structurally sound hypothesis
- Know the difference between leading and lagging metrics
- Have a better grasp on how to apply testing to web, email, and usability platforms
- Identify the key indicators of a well-designed offline experiment
Why do we test?
Consider the case of Orville and Wilbur Wright, the famous American duo responsible for inventing the world’s first airplane. As you might be not surprised to learn, the Wright brothers didn’t launch a fully-built flying machine on their first attempt. They initially started with kites. The kite the brothers built in 1899 was only 5 feet wide. It was texted extensively over a period of months, constantly being refined, augmented, and tested again.
In 1901 the brothers created their own version of a wind tunnel where they experimented on over 200 different types of wings and gliders, the winning versions of which would eventually go on to form the basis for their historic powered flight in 1902.
So what’s the point? As brilliant as the Wright brothers were they understood the dangers of failure. They knew that if something went wrong during a manned flight one or both of them might not be around to make a second attempt.
Experimentation is essentially risk management. Would you rather move forward with your multi-million-dollar campaign strategy with the foundation for success resting on nothing but a hope and a prayer? Or would you feel far more confident in a massive business initiative if you’d already found enormous success within a test sample?
Experimentation as a business tool has taken root in many modern organizations as a necessity in a competition fueled workforce. Facebook has formed their own research unit (Facebook research) and Microsoft’s ExP Platform is dedicated to the continued launch, analysis, and theoretical exploration of online experiments.
The act of testing, refining, and applying is not just an essential part of business, but of life itself, rooted in the foundations of human beings as an unconscious necessity for survival.
What makes a good experiment?
Simply put, an experiment is a test of a hypothesis. A hypothesis (if you skipped remedial science) is a proposed explanation for a phenomenon based on limited data.
A good hypothesis must be both testable and falsifiable. For example, if I present the hypothesis that I am an occult mystic because I can see ghosts with my third eye, there is no way for others to test or disprove my assertion (unless you happen to know Dr. Strange).
Better examples of a falsifiable hypothesis might be:
- I believe my shoe business is losing money because our sneakers are priced too high and by reducing the cost we will increase sales by at least 50%
- I believe that Bruce Willis has starred in higher rated movies than Nicholas Cage because he has more experience acting in high budget films
- I believe that by changing the banner on our website to something more representative of our core customer demographic, visitors will convert more frequently
All hypotheses start with some form of “I believe.” We don’t know for a fact if our hypothesis is correct, otherwise why would we be testing on the first place?
Next comes an explanation of the change we will make to the control set (or our default data) and a statement of what we think will happen if our test goes the way we predict.
Despite what it may seem, we can never prove a hypothesis to be “true.” We must always leave some room in our mind for doubt and suspicion. By stating that the change we observe from a test is 100% true, we are in danger of falling for the problem of induction, or assuming that an empirical observation is a fact.
For example, if my hypothesis is “All swans are white,” it would nearly impossible to prove this hypothesis correct (I would need to find every single swan in the world to make that claim) but all it would take is a single black swan to completely disprove my idea.
For that reason, all hypotheses have an opposite called the null hypothesis. If our hypothesis states that we believe there is some difference between our control and treatment, the null hypothesis states its opposite: There is NO difference.
Hypothesis testing is a bit like a trial. Instead of “innocent until proven guilty,” we assume “Null until proven otherwise.” We must demonstrate that the evidence against the null result is so huge we’d be very surprised if it turned out there really was no difference at all.
Check out some examples of null hypotheses below.
- By reducing the cost of our sneakers, sales will not increase by at least 50%
- Bruce Willis has not starred in higher rated movies than Nicholas Cage
- If we change our website banner, visitors will not convert more frequently
Once we have a hypothesis (Represented as HA) and a null hypothesis (Represented as H0) we can begin considering what type of test will run, for how long, and with how many samples.
Take some time to think about interesting test ideas you might have and how you would structure those tests in the form of a hypothesis. For every hypothesis you create, write down what the corresponding null hypothesis would be.
Lagging vs. Leading Indicators
One of the primary goals of experimentation is to understand the causal relationships between leading and lagging indicators.
A lagging indicator is an outcome. The data point. The numbers you typically show to the big boss. Revenue, downloads, and sign-ups are all examples of lagging indicators. Lagging indicators are often very difficult to change directly, especially if you have an established product or service. While it’s always necessary to keep track of lagging indicators to evaluate ongoing performance, as marketers we want to understand what causes those metrics to move either up or down. That’s where leading indicators come in.
Leading indicators are the “influencers” of lagging indicators. They predict how lagging indicators might be affected over time. The trouble is they are oftentimes very hard to pinpoint and equally hard to measure.
For example, a leading indicator of customer lifetime value might be user satisfaction. How is that measured? Is it something that can be calculated accurately? Are satisfaction scales different from one visitor to another?
Although it is difficult to measure customer satisfaction it is also much easier to influence. Think back to the last time you had a very bad experience in a store or online. How quickly did your level of satisfaction change? Oftentimes a single delightful or terrible experience is enough to alter a visitor’s perception of your brand forever.
Take a look below at some the leading indicators of major brands.
As you can see, leading metrics differ wildly depending on the brand and industry. Companies with large amounts of data (and even those without) regularly engage in statistical exercises like regression analysis in order to discover potential correlations between a leading and lagging indicator.
Do you know what lagging indicators represent your business KPI’s? What leading indicators are the clearest predictors of these success metrics?
Understanding what type of metrics are most important at driving growth helps us plan better more efficient experiments.
Before moving on to the next section, list down a few lagging indicators for your business. Common examples are Revenue and Conversions. For each lagging indicator, jot down several leading indicators you believe are strongly correlated to those metrics.
Which do you feel would be easier to test and why?
A/B Testing is the most well-known type of hypothesis testing in the mainstream marketing world. Popularized in the mid-late 2000’s by the rise of DIY website testing tools like Optimizely, Monetate, and Adobe Target, A/B Testing turned a previously painful and statistically rigorous mathematical process into a comparatively easy method of improving conversion rates.
The modus operandi of A/B Testing is straightforward: Randomize all web visitors into two or more groups, then show each group a different set of content. These content changes can be as small as a copy alteration or changing the entire website itself. When the test has enough visitors in each variation, analyze which version had the better performance by measuring key metrics during the test over time. Apply the winning version and badabing-badaboom, you’ve just made money! (Or at least, that’s the idea).
We don’t have enough space or time to do a full review of A/B Testing here but CXL has mountains of amazing resources on A/B Testing that should satiate even the most data minded marketer out there.
The important thing to remember about A/B Testing tools like Optimizely, VWO, or Google Experiments is that they are only automating a standardized testing process, the same standardized process that’s been used for nearly 100 years by scientists and researchers all over the world. Before moving on, let’s examine the steps of that methodology and how it takes shape.
- Randomization: Randomization is also very important. By randomizing the visitors who are initiated into each test cell, A/B Testing normalizes for outlier effects or other unusual events of chance. Without randomization you could open yourself up to potentially test-ruining sampling errors.
- Statistical Analysis: No hypothesis test could be complete without some form of statistical analysis. This analysis tells you how surprising the difference between the means of each variant might be and informs our decision to accept or reject the null hypothesis (Remember those terms from section 1?).
The beauty of experimentation is that the vast majority other types of tests follow the exact same rules. Therefore it’s important to understand not just WHAT your testing program is doing, but WHY it’s doing it. Once that’s clear, it’s easy to apply those key principles to many other areas of the business.
What are some other ways to use A/B Testing?
A/B Testing isn’t limited to standard web changes. Many tools are now beginning to offer such a solution on their platform. Let’s review a few of the more popular uses cases of A/B Testing and how you can leverage them.
A great use of A/B Testing is subject line and email content testing. Email tests can be deployed very quickly to a wide audience. Oftentimes these tests can return powerful results in short amounts of time. Many email delivery solutions like MailChimp, Bronto, and Campaign Monitor all have A/B functionality built into their programs.
Warning: You must be very careful before using an email delivery service to launch and evaluate A/B Tests. These tests often do not use any form of statistical analysis, meaning that declaring a winner is merely the result of pure mathematical observation! This is obviously unscientific and might be setting the foundation for a test program built on statistical phantoms.
If you’re not already doing usability testing, you should. Usability testing is the process of measuring how real users interact with the interfaces and designs of your online or offline properties. While your website or mobile app may look very nice from the outside, if a visitor can’t accomplish key tasks or runs into infuriating errors they will inevitably have a poor user experience. Do not underestimate the wrath of a really pissed off user!
There are many different methods of testing usability. We can run tests on a solitary web property or prototype in order to discover baseline metrics to compare against for future testing (this process is called benchmarking). We could also run comparative tests of one design vs. a modified design, or even against a competitor to understand how intuitive our program is when compared against other industry leaders.
Pretty powerful stuff right? The downside of usability testing is that we inevitably have much smaller sample sizes than what we typically see in standard web-based A/B Testing. But don’t fret, there are plenty of statistical methods we can do to account for tests with low sample sizes.
Pro-tip: If you’re making any type of comparison that implies inference (aka: “I believe our users will continue to behave in the manner we’ve observed”) then you must engage in some type of statistical analysis.
When you conduct a survey, what is it that you’re really doing? Unless you plan to receive a response from every single person in your audience population (which is impossible if your audience is still growing) you are making a statistical inference based on a sample. That means you should be running tests on survey data too.
Want a memorable rule of thumb? “If you want to compare, without statistics beware.” This advice doesn’t just apply to the digital world. Anything that examines the relationships between two or more sets of independent data should optimally be run through a statistical test in order to determine if the observed results are due to noise or a meaningful difference.
Some things to consider before moving on to the next section: How does your organization currently handle comparative data? What does your company test and what do they not? Can you see any opportunities for testing that may have gone unexplored until now?
Experimentation outside the digital framework
While the majority of optimization specialists focus on digital testing, the principles of experimentation we have discussed here can be applied to any part of the business, either internal or external. For example:
- Testing whether opening customer service lines one hour later negatively impacts customer perceptions or complaints
- Comparing checkout times in stores with automated vs. non-automated checkout
- Measuring whether using a dimmer, but more cost effective light-bulb has an impact on in-store foot traffic
- Testing whether major brand alterations (such as a name change) impact the number of localized website visitors
Some business models use experimentation to great effect already. QSR’s (Quick Service restaurants) such as Subway are prolific testers. With huge amounts of test subjects (franchises) an innovative R&D department, and a quick activation period, Subway has shown it has all the necessary ingredients to build a fantastic and effective testing program.
Other large chains like supermarkets, clothing outlets, and home décor companies have the ability to test extensively on product lines, pricing, visitor flow, and many other creative and wider reaching solutions.
Many tests can be done with the data that already exists in the business today. For example, in 1996 a relatively large supermarket chain “Dominick’s Finer Foods” discovered through experimentation that framing discounts in a bundle (“2 for $1”) instead of as single unit discounts (“1 for 50 cents”) significantly increased the number of units purchased. The same chain also discovered that imposing a coupon limit of 12 dramatically increased total unit volume over the standard coupon limit of 4.
In another example, a retailer based in the Midwest ran an experiment on musical “zoning,” or playing certain songs in different departments that most matched typical consumer demographic preferences. They found that altering the music by department had a significant effect on the amount and value of a customer’s purchase. Both of the above examples above spent minimal additional resources, yet the effects of these tests had a profound a profound impact on the bottom line.
Best Practices of Offline Testing
While choosing ideas for an A/B Test is a relatively risk free exercise when done correctly (online tests can be stopped immediately, QA’d extensively, and so on) offline experimentation is not quite as simple. Each offline test requires some level of buy-in and in some cases that investment is a sizable financial figure.
Choosing what to test is not a simple process and must be undertaken after careful deliberation. There are three general guidelines to follow before proposing an offline experiment to leadership.
- Is it wanted?
- Is it feasible?
- Would it be profitable?
Is it wanted?: The first thing we should always ask before launching a new experimental business initiative is: Does the customer want this?
If the customer isn’t interested in what you’re offering then it doesn’t matter whether or not your testing program has the budget to roll out a test to 500 stores nationwide, it’s going to be a waste of money.
There are many methods that help in understanding whether a product or service is wanted or not, but for now we will just focus on two. The first is easy- Talk to your customers. Ask them what changes they would like to see or whether an additional feature would help their buying experience.
When speaking directly to customers about their preferences, it’s critical to be cautious of leading questions. In 2009 Walmart ran a survey asking if people preferred cleaner aisles. When the vast majority of survey participants said yes, they thought they had a winner.
Several months and millions of dollars invested in trimming back product lines later, Walmart discovered in-store sales had plummeted. It turns out one of the defining aspects Walmart was that store-goers could find almost anything on the shelves. Losing the “clutter” actually damaged their inherent value proposition. It was a 1.85 billion dollar mistake.
In another example, Southwest Airlines was known for refusing to add inter-line baggage, reserved seating, and food service (though they’ve since caved on some of these). The reason wasn’t because the airline was inattentive or didn’t care about its customers, but because the airline’s differentiating factor was its low price-point and on-time service. Although frequent fliers may have wanted a slew of additional features, they did not want these to come at the cost of their existing perks.
The second method of uncovering customer feedback is through pilot studies. Prior to rolling out an experiment to 10, 50, or even 100 stores, measure the responses and feedback from one or two stores. Will these results be statistically significant? Well…possibly. Will they be representative of your customer population? Probably not.
However, measuring the customer experience in the wild is far better than solely focusing on what marketers believe people want. Conducting qualitative research is a great way to take the pulse on your visitors and hear from their own mouths whether they think your idea is good or bad.
Is it feasible? It’s true that experimentation should not be confined to rigid business goals. However, it’s important to consider the budget, resources, and potential metrics that might be affected negatively by a failed test in advance.
The first consideration in feasibility is practicality. How will you accomplish this test? What sort of resources and manpower would you need to execute it on the ground?
If you are testing different opening hours for a physical location, for example, you’d need to think of how employees will be compensated for the lost time. Will it be a required change or opt-in only? Will managers be paid a stipend for participating in a test that might damage their bottom line? Will local marketing campaigns be needed to make users aware of the change, and if so how, and how much will such an advertising effort cost?
Would it be profitable? Some might say it’s a jaded way of thinking, but the end goal of any experiment is to make the company money.
Think carefully about whether or not the test concept is likely to result in an impact on the business’ bottom line. While it’s not a strictly bad thing if a test produces no results, it’s also not optimal. Each test is an investment and your goal is to make sure those investments pay off as frequently as possible.
Imagine you are adjusting a product’s price point from $5.99 to $4.99. After conducting the test, the revenue gained from the additional foot traffic to the store was nullified by increasing margins. Even though the test was a failure you learned something important: people DO respond well to cheaper prices, but not nearly as well as you might have initially believed!