Lessons In Website Testing From A Master

Ronny Kohavi helped Bing and Amazon become titans. Here’s how.

Lessons In Website Testing From A Master
[Image: IM_photo via Shutterstock]

When it comes to the personalized Internet, few people know the inner mechanics of how websites conduct experiments better than Ronny Kohavi. His formal title is general manager of the analysis and experimentation team at Microsoft’s Applications and Services group, but odds are you know his work on Bing and the extremely popular Bing Images. Before he joined Microsoft, he was Amazon’s director of data mining and personalization. He also has a set of guidelines he says all website experimenters should follow.


Kohavi and colleagues Alex Deng (Microsoft), Roger Longbotham (an ex-Microsoft employee who now teaches at China’s Southwest Jiaotong University), and Ya Xu (LinkedIn, but a former Microsoft employee) presented a paper at the KDD conference, a huge data science confab, in New York on August 27. In the paper, Kohavi’s team shows off their seven rules for website experimenters, which they collected while working at Bing, Amazon, LinkedIn,, and other services.

Website experimenting, and testing small changes in websites, he says, helps organizations. Despite recent bad press for website testing for big names like Facebook and OKCupid, Kohavi feels user testing is the most efficient way for websites to confirm the changes they’re making are the right ones. As he explained over the phone to Co.Labs, he believes it’s more useful to test website changes on a smaller user sample before rolling them out–otherwise, you risk giving all of your users a negative experience if the changes are disliked rather than a smaller percentage.

Here are the seven rules of thumbs for web testers that Kohavi and his team uncovered:

1. Small changes can have an impact on key metrics.

“In software, when you ship software, you have bugs,” Kohavi told Co.Labs. “It is very common for us to ship something we think is small in an experiment and then see a large negative impact because there is a bug. We once made a change in JavaScript code that was tested on the desktop and went live as a 10% experiment. It then turned out that Internet Explorer 7 users couldn’t even click on the link, which was a terrible experience.”

In the paper, the authors give the example of Amazon tweaking the way they displayed credit card offers in 2004. By moving Amazon credit card offers from the homepage to checkout pages, and adding text that showed users how much they could save by signing up for an Amazon card, the retailer made “tens of millions of dollars in profit annually” through the simple change.


2. Changes rarely have a big positive impact on key metrics.

The Microsoft team says that it’s extremely rare for any one test to have a large impact on key metrics for a website. Rather, small changes implemented by each individual test add up. “The day-to-day improvements we make to Bing are made inch by inch,” Kohavi added.

3. Your mileage will vary.

While it’s important to keep informed on the tests your competitors are running, the Microsoft team says that what works for one website won’t necessarily work for another. “Results aren’t always reliable because people are duped by chance.”

4. Speed matters.

More than anything else, users respond to speed and fast reaction times on the websites they visit.

“We ran slowdown experiments (on Bing) where we slowed user experience a little to see what it would do to our metrics–very slightly. If we slowed users down by 100 or 200 milliseconds, it was really amazing how much performance matters. This isn’t just Bing, we saw this at Amazon where numbers are big, and Google did similar experiments as well,” Kohavi says. “They’re all based on the simple idea that it’s easy to look at small degradations in performance and see what happens to user statistics–the changes are surprisingly large.”

In Bing’s case, it turns out that every 100-millisecond speedup boosted revenue by 0.6%.


5. Reducing abandonment is hard, but shifting clicks is easy.

This is just a complicated way of saying that it’s easy to get a user to click on different spots on a page–but harder to get them to click more on a page.

Kohavi gave the example of a test where Bing removed related searches from the search engine’s right-hand column for over 10 million users. In that experiment, pages still got approximately the same amount of clicks–it’s just that users subconsciously spread their clicks around the rest of the page.

6. Avoid complex designs.

Iterate! Kohavi and his co-authors strongly believe that simpler is better for website testing. He says the best tests are A/B, A/B/C, or A/B/C/D, and that more complicated testing doesn’t necessarily benefit the organization.

One example cited in their work was testing on LinkedIn’s unified search platform. When the social network changed their search engine in 2013, complicated tests with multiple factors made it hard for the site to find elements in the new search engine that users disliked. It’s only when tests were radically simplified that LinkedIn was able to find which aspects of the new search functionality turned users off.

7. Have enough users.

This is the hardest one. Without enough users visiting the site and serving as unwitting subjects, Kohavi’s team believes it’s impossible to get accurate results. The magic number, however, depends on a number of different factors.


Clarification: An earlier version of this article specifically mentioned Kohavi’s work on Bing Images. Kohavi’s work includes experiments throughout Bing.