Inside Baidu’s Plan To Beat Google By Taking Search Out Of The Text Era

In this exclusive interview, Andrew Ng explains why he left Google Brain to head Baidu’s deep learning project.

Inside Baidu’s Plan To Beat Google By Taking Search Out Of The Text Era
[Photo: Nelson Ching/Bloomberg via Getty Images]

Text-based search has been the input of choice for web search engines for the past 24 years. That’s soon going to change.


Baidu, China’s biggest search engine, recently hired former Google Brain mastermind Andrew Ng to head up a massive deep learning project. Focused on building an infrastructure for solving problems like image recognition and speech processing, Baidu’s work signals a paradigm shift in the way users retrieve information online.

Ng was announced as Baidu’s new head of research back in May, working out of the company’s Silicon Valley offices. One of his first big projects with Baidu is creating a vast deep learning computer cluster with around 100 billion digitally simulated neural connections. By harnessing the power of deep learning, Ng hopes to revolutionize the way we carry out search functions.

“With the Google Brain project we made the decision to build deep learning processes on top of Google’s existing infrastructure,” he says. “What we’re doing at Baidu is seizing the opportunity to build the next generation of deep learning infrastructure. This time we’re building everything from the ground up using a 2014-base GPU infrastructure.”

Baidu has given Ng room to work on some of the biggest deep learning problems around. “From the engineers through to the executives, I think everyone at Baidu really ‘gets’ this field,” he says. “Deep learning is a very capital-intensive area, and it’s rare to find a company with both the necessary resources and a company structure where things can get done without having to pass through too many channels and committee meetings. That’s essential for an immature technology like this.”

Andrew Ng, one of Fast Company‘s Most Creative People in Business

The primary catalyst for a step-change in how search works today is the rise of smartphones and tablets, which are taking away more and more market share from traditional PCs. This is particularly evident in countries like Baidu’s birthplace China, where many users are connecting to the Internet for the first time–primarily by way of mobile devices. Of the 632 million Internet users in China as of June this year, 83% accessed the web with a mobile phone, according to figures from China Internet Network Information Center.

Most of these users haven’t organically learned how to use text-based search as it’s evolved from Ask Jeeves to DuckDuckGo over the past several years. That presents an opportunity to re-think basic assumptions about search, and it extends beyond developing markets. “Text input is certainly useful, but images and speech are a much more natural way for humans to express their queries,” Ng says. “Infants learn to see and speak well before they learn to type. The same is true of human evolution–we’ve had spoken language for a long time, compared to written language, which is a relatively recent development.”


In many cases, text-based search is not ideal for finding information. For instance, if you’re out shopping and spot a handbag you might like, it is far better to take a picture than to try and describe it in words. The same is often true if you see a flower or animal species that you would like to identify.

Fortunately, more and more of our devices now have high-quality cameras built in–from smartphones with front- and back-facing cameras to wearables like Google Glass or the recently announced Baidu Eye.

At the same time, deep learning tools are becoming more adept at intelligently recognizing and decoding visual information. “Previously we thought about modalities like language and images having different, separate representations,” says Edward Grefenstette, Fulford junior research fellow at Somerville College, and an AI Researcher in the Department of Computer Science at the University of Oxford. “With deep learning there has been a movement toward what is called distributed representations. This allows us to do things like align the meaning of two different languages, or language and image, in the same representational space.”

That means if there is a new image that has never been seen before, deep learning breakthroughs make it possible to generate text describing what it is–based on an “understanding” of what is being shown. (Check out an impressive demo by the University of Toronto here.)

The results of this research are already starting to become visible. Earlier this year Facebook created DeepFace, a facial recognition system almost as accurate as the human brain. Google has also made significant advancements in the field of deep learning, even after the departure of Andrew Ng. Executed correctly, Baidu’s work has the potential to be a key part of one of the biggest AI breakthroughs ever.

It’s not just image recognition, either. “Deep learning has pretty much taken over speech recognition,” Ng says. At Baidu, error rates for speech recognition are down by about 25% as a result of deep learning research.


At the moment, around 10% of Baidu search queries are done by voice, with a much smaller percentage carried out using images. If progress continues at its current rate, however, Ng forecasts that “in five years time at least 50% of all searches are going to be either through images or speech.”

“Replacing text search by voice search is clearly to happen more and more, as speech recognition improves,” says Yoshua Bengio, a professor at the Department of Computer Science and Operations Research at the University of Montreal, home to one of the world’s largest concentrations of deep learning knowledge.

Andrew Ng is under no illusions about the challenge his team faces, though. Deep learning is still a new field–and despite its massive potential it can be the victim of unnecessary and unhelpful hype.

“I believe that we have not yet exploited the power of deep representation learning–and especially of the unsupervised type–and that the impact in applications could be very important a few years down the road,” says Bengio. “Basic research is needed for this to happen, though. Some of [this] might happen in industrial labs, as leading researchers there–including Andrew Ng, Geoff Hinton, and Yann LeCun–basically agree that this is an important opportunity for major future progress.”