Microsoft Virus Fighter: Spam May Be More Difficult To Stop Than HIV

Microsoft researcher David Heckerman helped stop junk mail headaches. Now, he's using some of the same techniques he helped pioneer to fight HIV.

David Heckerman has been studying HIV for the past eight years. But unlike most other scientists looking to cure one of the most deadly diseases of our time, Heckerman is a Microsoft researcher who built the e-mail spam filters used in Hotmail, Outlook, and Exchange. 

As both a computer scientist and medical doctor, he realized that spam and HIV share certain traits—they mutate constantly, have myriad potential variations, and spread quickly. Heckerman's team at Microsoft’s Research Labs uses graphical models for data analysis and visualization to track how the virus mutates in hopes of building a vaccine to finally stop it. He has partnered with two major research consortiums, IHAC (the International HIV Adaptation Collaborative) and the International HIV Controllers Study, a group that tracks people who are managing to keep the virus at bay without major medications, and hopes to help usher a new kind of vaccine to clinical trials in South Africa sometime within the next two years.

Here, Heckerman shares what he's discovered working on the frontline of junk mail and virus-fighting. His explanations shed light on both scourges and, as it turns out, spam may be the more difficult one to eradicate.

Finding 1: Both Spam And HIV Mutate A Lot.

A spam filter basically uses pattern recognition to search for clues in messages that might be junk. "Early, on, this was the main problem we had with our spam filter. It was not that the mail itself mutated, it was that the spammers would send mail with messages designed to cleverly get around the filter," Heckerman says.

"They started putting the '@' sign in Viagra. We would change our spam filter to recognize that and they would use bitmaps until we’d recognize that and figure out what they were doing. We had this adversarial situation where we thought, Okay what we have to do is go out to the 'weak link' in this situation. What is the one thing they can't change? Spammers can't change the fact that they want to get money out of you, that they take you to a site asking for your credit card information. We started cataloguing the sites where the links redirected and looking for the links in messages and it worked very well."

This became the basic premise for fighting spam in Hotmail and Exchange.

HIV works similarly. "If you look at how much HIV mutates when it gets into a single person, it mutates as much as the flu has mutated in the entire known history of the flu. If you look at it across a lot of people it is extremely diverse. Our immune system attacks and the HIV mutates to avoid attack again, so things go back and forth." So he started looking for the virus’s weak link.

"Working with researchers at Harvard, we think there are certain spots in HIV where HIV really doesn’t want to mutate. If it mutates, it weakens the virus tremendously. We are trying to pinpoint those spots, then vaccinate to train our immune system to attack those weak spots so that when HIV does mutate, it cripples it. We are using the same strategy of going after the weak link of the opponent."

Finding 2: Big Data Is Required.

"If you just give me a handful of spam messages, I would not be able to figure out what the weak points are," says Heckerman. The first spam filter deployed at Microsoft was based on a very limited data set:20 people saving their junk mail. Today, Hotmail collects hundreds of thousands of data points a day just from the small fraction of people who volunteer to label their spam—and that makes it far easier to fight.

The same applies to HIV, he says. "You show me ten to a hundred samples of HIV and I’m not going to be able to learn anything. You show me hundreds of thousands of examples of HIV works and I start to see its secrets."

In both cases, researchers are using machine learning to create statistical ways of dealing with large datasets in order to find the needle in the haystack.

"In case of junk mail, it was figuring out which links were really predictive of being junk mail," he says. "For the HIV side, its same thing: How to figure out where the weak links are in HIV?" Of course, Microsoft doesn’t have a wet lab. To access gene-sequences, Microsoft coordinates with IHAC and IHCS, which are in the process of collecting tens of thousands of blood samples from infected people around the world.

Yet clues for what immune responses might stop the virus are far more subtle than spotting spammers. The HIV genome is made up of 9,000 different nucleotides that should be examined. People have different immune systems which are capable of attacking the virus at different points. Some immune systems can attack at a single point and effectively cripple the virus, but some immune systems may have to attack at multiple points to do so. "To make a vaccine that is effective for a whole population, we need to study both single-point and multiple-point attacks," Heckerman says. "In HIV, when you have all that diversity, it is really hard for human being to look at all data and figure out any secrets by inspection. It really takes a computer to do that."

Or rather: computers. Heckerman uses a cluster of machines sitting on the 4th floor of Building 99 in Redmond that operates at 3,000 times the processing power of a single computer. He has also designed a patented algorithm called PhyloD to sequence each HIV genome and catalogue potential weak links among nucleotides in each sample for comparison among other data sets. "It would take more than a year for single machine to do that, but we can do it in a few of hours."

As a result: "We can measure each person immune system type and what the HIV does to that particular person and see how it maneuvers around in different immune systems to get a better idea of what HIV can and can’t do."

Finding 3: Spam Will Most Likely Outlast HIV.

With spam, even an approximate fix can be deployed publicly. And as spammers try new tactics, the Microsoft team just adds new tweaks and instantly tests them. "We do have clinical trials in one sense because we hypothesize how a model is going to work and test it. In the case of spam, that data is always very easy to come by." Plus, any minor errors can be corrected.

Not so with a vaccine. Any HIV fix will have to account for all strains and immune types when it's released. And there is no room for error when dealing with medicine. "The primary goal is to prevent you from getting it in the first place, or, it you get it, to suppress the virus to the point where it not going to hurt you, you are not going to go on and get AIDS." Or, in email parlance: "It’s way easier to filter it up front than it is to clean up your messages after you have been spammed."

But once a vaccine has been created, it shouldn’t really need updates. "I hope it's easier," he says of controlling HIV versus stopping spam once and for all.

"The difference between spam and HIV is that in the case of spam we have clever yet evil humans trying to get around spam filters. In the case of HIV there is no conscious entity trying to defeat us. It’s just that Mother Nature evolved a horrible virus. Hopefully, when we identify all the weak spots, we are done."

[Photo Illustration: Joel Arbaje]

Add New Comment

1 Comments