Marketing experts argue that A/B testing allows marketers to find the best performing content, design and product offers. Marketers also conduct A/B tests to find the best performing landing pages, copy, email subject lines and a range other types of content and prices.
Yet many marketers get bushwhacked by erroneous results from their A/B tests. New research from the Wharton School at the University of Pennsylvania found that 57 percent marketers are guilty of “p-hacking.” They analyze data improperly and reach incorrect conclusions. The extent of the improper analysis was surprising, researchers told Knowledge@Wharton.
The paper, p-Hacking and False Discovery in A/B Testing, examined more than 2,100 experiments from nearly 1,000 accounts on the online platform Optimizely. The platform has since installed safeguards to prevent such errors.
Researchers compare p-hacking to “peeking.” Marketers check the experiment before it’s completed and stop the test when they see desired results. The problem is that initial results may differ from results of the completed test. Many marketers stop the test the first time it reaches the 90 percent confidence threshold. If they waited, the threshold may fall.
Marketing Professor Calls P-Hacking Cheating
“Experimenters cheat themselves, their bosses or their clients,” said Wharton marketing professor Christophe Van den Bulte Van den Bulte.
P-hacking can show differences in the performance between two versions when no significant variation exists. Worse, the trick can lead to costly decisions. If an A/B test shows that two-day free shipping will produce more sales than 10-day free shipping, a company may commit to a costly strategy.
In addition, companies may also stop tests and stop seeking better alternatives. If it mistakenly believes that A is the best version, it might not bother to test another version.
Marketers may commit the data analysis mistake due to lack of background in statistics, researchers theorize. They may also feel pressure to show managers or clients that one version is clearly better. Those in the media industry were more likely to p-hack, while those in the tech sector were not. Advertising agencies stand to gain, at least in the short term, if they assert that an idea improves performance.
That doesn’t mean they knowingly fudge results, but researchers suggest extra caution when viewing tests run by third parties.
Possible Solutions
Researchers suggest possible solutions: Testing platforms can make statistically significant results harder to achieve or install changes to protect against p-hacking. Companies can promote a culture of accurate, ethical testing. Marketers can conduct follow-up tests with small control groups to confirm that their findings were not just a fluke of an A/B test.
Previous research revealed that most A/B tests fail to produce statistically significantly results. Fewer than 20 percent of 3,900 marketers surveyed by UserTester reported that their A/B tests produce significant results 80 percent of the time, according to eMarketer. A main issue is that marketers often test insignificant changes that customers don’t care about or barely notice.
Experts offer these recommendations to design A/B tests that provide meaningful results:
Create boldly different choices. Small differences can produce meaningful results for major companies with enormous traffic, but for most businesses the tweaks cause no noticeable improvement.
Be patient. Realize that obtaining results may require a few thousand website visits or two weeks. Test substantial changes to avoid spending time waiting for small improvements.
Remain persistent. Accept inconclusive results as part of marketing analysis. Some managers may consider a test that shows inconclusive results a failure. It’s not. It shows that what was tested has little influence. That’s valuable information.
Test for sensitivity. Test a specific hypothesis and identify which elements impact results and which elements don’t “move the needle” and change consumer choices, explains Claire Vo, vice president of product management at Optimizely. Stay disciplined and keep an organized list of what impacts results.
Consider segmenting data. Examine test results across segments like devices, traffic sources and other factors, suggests Brian Massey at Conversion Sciences. Keep in mind that segments need to have sufficient sample size to produce conclusive results. Beware of implementing changes for segments that don’t drive significant revenue or leads.
Keep the original version. If tests don’t reveal a definite winner, keep the original version (the control). That’s simpler and conserves resources.
Bottom Line: Marketers frequently reach incorrect conclusions in A/B testing, perhaps out of ignorance or an urge to find the version that out performs the other. Either way, faulty analysis can consume resources and lead to expensive business decisions. Following recommendations for A/B test design can mitigate costs of mistakes and produce consequential results.
William J. Comcowich founded and served as CEO of CyberAlert LLC, the predecessor of Glean.info. He is currently serving as Interim CEO and member of the Board of Directors. Glean.info provides customized media monitoring, media measurement and analytics solutions across all types of traditional and social media.