Back in 1999, IBM unveiled plans to release its next generation of advanced super computer. Its code name: Blue Gene, because its main job would be to simulate the intricate folding of human proteins.
Blue Gene's power is expected to be so immense that it's hard to grasp. When Blue Gene was first announced, IBM officials told reporters that the new computer would have the processing capability to theoretically be able to download every page on the Web in less than a second. Blue Gene isn't scheduled to begin operations until sometime after 2004, past its scheduled completion date. But when it is finished, it will be far faster than today's fastest 500 computers in the world combined.
Blue Gene will have up to 1 million processors, working to solve complex modeling problems. The enormity of that scale is what first caused IBM researchers to rethink the computer's design. "We will lose processors every day just from cosmic rays that enter our atmosphere and bombard the chips," says Bill Pulleyblank, the IBM research executive in charge of Blue Gene. The vast array of processors, combined with the odds of large numbers, means that, on average, three processors a day would get zapped by radiation or fail for other reasons.
"So if I lose three a day, I may have lost 1,000 processors after one year," says Pulleyblank. Out of 1 million processors, that might not seem like many. But computers aren't built to cope with failed processors. When a processor fails, computing simply stops. No work assigned to that failed processor gets done -- the computer just waits for a person to come fix or replace the processor. Other processors waiting for that work to be finished may also settle into waiting.
But when you've got a machine that cost $100 million to develop, a machine that, despite its awesome size, needs to run all the time to finish the kinds of problems that it's working on, you can't let losing one-tenth of 1% of your processors bring everything to a halt.
This is the problem that inspired SMASH. The Blue Gene team had two choices. The traditional approach would be to have technicians scramble from one end of the machine to the other every day, finding failing processors and swapping them out. The alternative choice: SMASH -- the autonomic approach. "I have to operate with the assumption that any component may fail, unpredictably, ungracefully, at any time," says Pulleyblank, "and I have to keep working. That is a fundamentally different approach to computer design: Assume you'll have problems, assume you'll have errors, and build in the ability to deal with them and keep working."
Blue Gene will have the circuitry -- the hardware -- necessary to monitor itself. It will have a primitive form of self-awareness -- the software -- to understand how it is performing and to identify failures. And it will have the problem-solving ability and the physical components to reroute work and internal communications when things aren't working right, or as processors fail.
The ultimate goal of autonomic functioning is to be able to tell computers what you want them to do and have them work out the details. In other words, to create a world where strategy and implementation are inseparable.
As arcane as that kind of semi-intelligent automation may seem, we take it for granted in our lives. An air-conditioning thermostat is autonomic in the sense that a person sets the desired temperature, and the electronics of the air conditioner maintain that temperature. The automatic transmission on a car interprets the instructions of the driver's foot autonomically, compared with the operation of a manual transmission. On a much more sophisticated level, the telephone system -- a vast array of interconnected equipment, networks, switches, and service providers -- functions autonomically. It is both self-healing and virtually faultless. How often do you reboot your phone?
How might autonomic computing work in practice? Take the computers of a financial-services company. The company is constantly receiving transaction requests from its own brokers and directly from customers. The company's computers also need to provide routine information to employees and customers: account balances, transaction updates, research information. And the computers need to tend to all kinds of back-office chores: keeping the company's financial records, doing payroll, providing research information and email services.
Donna Dillenberger, a senior technical researcher who is developing some of IBM's new autonomic software with financial-services companies, teases out what happens when you add autonomic ability to such a company's computers. "Say you have three users trying to access the computers at the same time," says Dillenberger. "One is a premium user. You want that customer to get a quick response time -- under 3 seconds -- every time. The second user isn't a premium user, but you still want no more than a 10-second response time. Then there's the user who has no limits on the response time. That person can wait."