Why Replication Is Critical for Every Web Marketing Test – Search Engine Journal

Why Replication Is Critical for Every Web Marketing Test – Search Engine Journal

Walk away with new knowledge from 9-figure ecommerce brands that you can apply immediately.
Learn how to make every location AI-ready in the next 90 days.
Get your fresh, 2026 small business marketing plan, from SEO to PPC to AI Search.
This data‑driven guide contains essential information and actionable steps for SEOs navigating the shift toward AI as the first surface of search.
Learn how to build off-page authority & prove to Google that you deserve a high-visibility spot on AI SERPs.
Walk away with new knowledge from 9-figure ecommerce brands that you can apply immediately.
Tests that can’t be replicated aren’t scientifically valid. Here are some ways you can improve the validity of online marketing experiments.
“We’ve learned that the best data scientists are skeptics and follow Twyman’s law: Any figure that looks interesting or different is usually wrong. Surprising results should be replicated—both to make sure they’re valid and to quell people’s doubts.” Harvard Business Review, Oct 2017
How many times do you replicate a statistically significant web marketing test before reporting it out, making recommendations, or otherwise acting on it as if it were true?
If the answer is “rarely or never”, your web experimentation process is likely missing one of the critical components of scientific reasoning and knowledge discovery: replication.
For the past decade or so, online controlled experiments (such as A/B testing) have become common on practically all major commercial websites.
Most organizations operate under the assumption that with the right tools they can easily and predictably improve their KPIs by adding 5 percent incremental improvement here, 8 percent improvement there, and so forth towards future scientifically proven optimal – or at least improved – states while eliminating or substantially reducing the risk involved in making a change.
We’ve all seen the reports that A/B testing works for sites like Amazon, Facebook, Bing, Google, and the Obama campaign. But for the preponderance of commercial websites that don’t get anywhere close to this level of traffic, and don’t have an 80-person Analysis and Experimentation team like Microsoft, is scientifically valid progress truly being made in the realm of web experiments towards optimal states of user and system behavior?
You may have heard of the reproducibility crisis in modern science, but I’m here to argue that the crisis extends to controlled experiments run on websites.
Attempting to reproduce the results of an experiment or test is a defining feature of the scientific method.
Claims of truth based on a single experiment are misguided. No claim (scientific or otherwise) should be considered an accurate representation of reality unless it can be replicated.
This is one of the reasons scientific and other journals exist – to allow critical peer review and enable third parties to verify methods and results through replication.
Web professionals often already rely on reproducibility in their daily modes of operation without even thinking about, whether or not it’s a part of their experimentation process.
Consider the case of a website editor who notices what at first take appears to be an error in the loading and presentation of a webpage.
The foolish editor will immediately fire off a hasty email alerting colleagues to the problem.
The wise editor, though, will refresh the page, try a different page on the same website, try the same page in a different browser, try the same page on a cell phone, try the same page on a different web connection – all examples of replication before gaining a level of certainty that there is, in fact, a real issue worthy of alerting the QA or dev team.
As Neil deGrasse Tyson told Joe Rogan, without replication there is no scientific objective truth.
Many people are shocked to learn that it’s the nature of science that most research in scientific journals will ultimately be shown to be wrong.
The same is true for experiments run on the web by non-scientists who have not had their results pass the scrutiny of a journal submission.
As Tyson puts it, replication of experiments is “the bleeding edge of science” because it deals with complete unknowns.
There are no facts in the back of a textbook for verification that you conducted the experiment correctly – to say nothing of your interpretation of the results.
Anyone who has taken high school chemistry should recall the wild variation in findings that you or your lab partner found when attempting to replicate the most basic of experiments.
The modern scientific establishment has been described as being in a reproducibility crisis because scientific experiments are not being reproduced before being reported or taken as fact by the public, and when experiments are re-run the findings are very often not confirmed.
That findings are invalidated through failed replication is a natural part of the process of science and not a reason for alarm, but if replication studies are never conducted, results are reported prior to replication, or worse yet failed replications are thrown out rather than accounted for, then it becomes an enormous problem.
People are making decisions and acting based on faulty claims of truth.
As Tyson points out in that interview, news reporters take new findings and report them as fact when it makes for a good story.
Similarly, many people conducting controlled experiments on the web wrongly consider preliminary findings as fact. It is tempting to do this without realizing you are committing an error because it feels like progress is being made.
The reality is that a single test is not a scientifically valid “fact” and shouldn’t be acted on as-if it is. Tests must be repeated, with identical setup and with slight variations, to eventually demonstrate the truthfulness of a body of experiments.
It isn’t only scientific experiments that require replication to tease out insights and discover knowledge. Consider the “State of” report style that is increasingly common and is often the result of an annual survey of professionals in an industry.
The recently released DORA (DevOps Research & Assessment) State of DevOps 2018 report, for example, shows changes and trends over time due to replication of the survey.
On page 6 of the 2018 DORA report, replication is noted as critical to “confirm and revalidate prior years’ results”. A one-off survey is unable to show changes or verify previous findings. The data exists in a vacuum and as such isn’t as reliable as similar data from a multi-year survey.
This sort of long-term experiment is known as a longitudinal study, and although it isn’t identical to replication, the two approaches solve for many of the same problems.
Replication is today rarely included in web experiment processes. In a Harvard Business Review interview with Kaiser Fung, founder of the applied analytics program at Columbia University, it’s noted that:
“We tend to test it once and then we believe it. But even with a statistically significant result, there’s a quite large probability of false positive error.”
HBR continues:
“False positives can occur for several reasons. For example, even though there may be little chance that any given A/B result is driven by random chance, if you do lots of A/B tests, the chances that at least one of your results is wrong grows rapidly.”
It is vital to understand that statistical significance has no bearing on the validity of an experiment’s test of a hypothesis, and reaching statistical significance certainly doesn’t mean replication isn’t necessary.
Microsoft is one organization that publicly acknowledges its use of replication for web experiments. As they explain in their comprehensive report A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments (section 5.5):
“We ran an experiment in Bing.com where we observed a statistically significant positive increase for one of the key Bing.com OEC metrics…Very few experiments succeed in improving this metric…. whenever key metrics move in a positive direction we always run a certification flight which tries to replicate the results of the experiment by performing an independent run of the same experiment. In the above case, we reran the experiment with double the amount of traffic and observed that there were no statistically significant changes for the same metric.”
In this example from Bing, a finding that was initially thought to be positive and valid, and was statistically significant, was in fact shown to be inaccurate upon replication.
Try telling your CMO that a recent A/B test showing a 20 percent lift in the desired behavior, despite the initial experiment reaching statistical significance, is most likely an anomaly and further testing is required before determining the truth value of the original findings. This is simply the nature of science.
It often takes a long time, often moves slowly, and always requires meticulous documentation and replication of experiments before considering a claim to be objectively true.
Reproducibility and falsifiability (the capacity for a hypothesis to be proven wrong) are fundamental to the scientific method that has resulted in our awesome modern world.
For those readers who need a refresher on the scientific method, I recommend this handy guide published by the UC Museum of Paleontology of the University of California at Berkley, in collaboration with the National Science Foundation and a diverse group of teachers and scientists.
Regarding the role of replication, the guide advises:
“If a finding can’t be replicated, it suggests that our current understanding of the study system or our methods of testing are insufficient…. In some fields, it is standard procedure for a scientist to replicate his or her own results before publication in order to ensure that the findings were not due to some fluke or factors outside the experimental design. The desire for replicability is part of the reason that scientific papers almost always include a methods section, which describes exactly how the researchers performed the study. That information allows other scientists to replicate the study and to evaluate its quality, helping ensure that occasional cases of fraud or sloppy scientific work are weeded out and corrected.”
Similar standards of replication must be a requirement for controlled experiments on the web. Marketers, like scientists, should replicate their own experiments before broadcasting the results or acting on them as if they are true.
We’ve seen that replication is a critical step for valid scientific findings. After that has been completed, the next step to validate findings from an experiment is peer review from unbiased experts.
Science journals exist as a platform for peer review, allowing experts to provide feedback on experiment design, analysis, and findings.
Many scientific findings are rejected from inclusion in journals because they are not up to the standards of review committees responsible for maintaining quality and rigor. Only the small subset of overall scientific finding that withstand the scrutiny of unbiased experts are worthy of being considered valid enough to include in a journal.
The great deal of time spent on research that doesn’t make it into a journal isn’t a waste of time, it is a necessary condition for scientific progress.
Even with the added safety net of peer review scrutiny, poorly designed experiments still manage to get published.
As Understanding Science notes:
“Many fields outside of science use peer review to ensure quality. Philosophy journals, for example, make publication decisions based on the reviews of other philosophers, and the same is true of scholarly journals on topics as diverse as law, art, and ethics.”
The only platform known to this author that can solve for peer review today is GoodUI, which can provide ideas and added rigor for marketing organizations conducting experiments.
While this may initially seem untenable as organizations tend to be greedy with their data, the net effect of contributing to a cumulative body of trusted web experiment knowledge should prove to be worth the risk and rigor for those web experimenters who pursue this path.
If scientists worldwide began working in only small isolated pockets without peer review and published experiment methodologies, without an accumulative body of knowledge, scientific progress as we know it would cease.
While replication can help to solve issues with the methods and execution of an experiment, it will not solve for misinterpretation of results, which is shockingly common. This is where critical peer review becomes necessary.
There is no way to be certain that most organizations are not reproducing experiments before acting on them, but there is evidence of why this would the case – and anecdotally I’m sure most readers will agree with the Harvard Business Review cited earlier that replication isn’t a commonly accepted step in the web testing process.
Lack of reproducibility for web experiments on commercial websites, to the extent that it exists, is likely due to a combination of the following forces:
There are ways to improve web experiment processes and it won’t require a new vendor or a JavaScript snippet.
Some organizations may need to adjust their culture of testing and re-set stakeholder expectations about the use cases appropriate for scientific experiments, rigor required for valid scientific experiments, and duration required for initial significance as well as repeated replication.
Scientific certitude doesn’t come quickly or easily just because a team has access to an A/B testing tool and the desire for improvement.
Here are a few suggestions for improving the validity of your web experiments:
If your experiment idea isn’t worth the time it takes to follow these steps carefully, it probably isn’t worth testing to begin with.
Bring mindfulness to your decisions about which experiments to run and don’t get caught up in the hype tool vendors generate.
Use your best judgment for changes in cases that really aren’t appropriate for an experiment, and monitor KPIs after these sorts of changes are made to detect impact.
Meanwhile, keep looking for something to test that is worth the effort necessary to achieve legitimate results.
I will conclude this lengthy post with a quote from Stephen Jay Gould, from Full House: The Spread of Excellence from Plato to Darwin:
“[O]ur strong desire to identify trends often leads us to detect a directionality that doesn’t exist, or to infer causes that cannot be sustained.”
Be wary of this inclination as it relates to experiments on the web.
Special thanks to the invaluable research published at https://exp-platform.com/.
More Resources:
Stephen Watts leads SEO and web growth marketing at Splunk.com. Prior to joining Splunk, he was Director of Web Marketing …
Join 75,000+ Digital Leaders.
Learn how to connect search, AI, and PPC into one unstoppable strategy.
Join 75,000+ Digital Leaders.
Learn how to connect search, AI, and PPC into one unstoppable strategy.
Join 75,000+ Digital Leaders.
Learn how to connect search, AI, and PPC into one unstoppable strategy.
In a world ruled by algorithms, SEJ brings timely, relevant information for SEOs, marketers, and entrepreneurs to optimize and grow their businesses — and careers.
Copyright © 2026 Search Engine Journal. All rights reserved. Published by Alpha Brand Media.

source

Leave a Reply

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *