The CEO paces a conference room, brandishing a thick report. He gazes impatiently at his senior managers. “You’ve all read this,” he says. “Top-shelf consultants. Two million bucks. Pure Strategic thinking. This could put us years ahead.
“The board is psyched. I’m psyched. It’s a brilliant plan. One question: Given our current technology, is this implementable?”
The response, from five different chairs in the room: “No.” The CEO looks frustrated; he doesn’t look surprised.
Why would he be? The moment neatly captures the big problem of corporate strategy: the gap between the brilliant plan and the actual execution.
Having difficulty getting great strategies implemented is so commonplace that the above moment with the CEO came not from a recent corporate meeting but from popular culture: It’s a television ad for IBM. It’s IBM, global behemoth and regular information-technology consultant, mocking the corporate tendency to turn great ideas into three-ring binders that end up as doorstops.
As the commercial fades, one of the CEO’s lieutenants asks, “Still psyched?” Corporate strategy played for laughs. IBM meets Dilbert.
What the IBM ad doesn’t say is that the company has been working on the biggest problem in strategy not by rethinking strategy, but by working on the biggest problem in computing. For years, information technology has been mired in the details instead of focused on the goals (anyone who’s ever spent an hour trying to change email settings knows that). The one thing that computing is not is computerized. IBM is starting to think about goal-oriented computing, where you tell the computers what you want to do and let them work out the details. Strategy and implementation are literally merged: Come up with the strategy, and the implementation is automated.
IBM is breaking down strategy and implementation into smaller pieces, letting each component know the goals, monitor its own performance, and do some problem solving. At IBM, this merger of strategy and implementation became known as “SMASH”: simple, many, self-healing. The most effective computers would be made of many small, interchangeable components with the ability to monitor their own performance and solve problems as they arise rather than wait for instructions from the central processor: headquarters. Biology was IBM’s inspiration. A hangnail doesn’t prevent you from typing; the flu doesn’t prevent you from walking. Similarly, a small software or hardware problem shouldn’t bring computing to a halt.
SMASH could work equally well as an approach to corporate strategy. Imagine a company that develops its strategy and embraces SMASH. No all-controlling central brain. No separation of thinkers and doers. A biological approach to finding and solving problems, a way to make the company “self-healing.”
Now, a couple of years later, the problem that inspired SMASH at IBM is still being solved. Meanwhile, the ideas that SMASH inspired are driving a dramatic new approach throughout IBM’s R&D labs, where IBMers talk about “autonomic computing,” a new branch of information technology. In the human body, the autonomic nervous system is the one that operates behind the scenes, without conscious thought. It regulates everything from the amount of light that enters into your eyes to the immune response to disease.
IBM is so impressed with the potential of autonomic computing — and so humbled by the size of the problem that it has begun to unravel — that the company is trying to spark a computing revolution across the industry. It has even taken the unusual step of inviting sometime competitors such as Hewlett-Packard and Microsoft to help tackle the problems of automation. Paul Horn, head of IBM’s global-research labs, published 75,000 copies of a small, square-bound, autonomic-computing manifesto and distributed it last fall to his colleagues. And last spring, IBM hosted conferences at its research centers in California and New York to begin collaboration on autonomic computing.
In his call to arms, Horn invoked Star Trek as the model for what computers can and should be. “As early as the 1960s,” he wrote, “Captain Kirk and his crew were getting information from their computing systems by asking simple questions, and listening to straightforward answers — all without a single I/T administrator to be seen on the ship. Subtly, this has become the expectation of the masses.” Businesses, Horn wrote, want to “concentrate on setting the business policy for, say, [computer] security, then have the system figure out the implementation details. All that should matter to the business owner is: What does my business need to accomplish?”
Getting to SMASH
Back in 1999, IBM unveiled plans to release its next generation of advanced super computer. Its code name: Blue Gene, because its main job would be to simulate the intricate folding of human proteins.
Blue Gene’s power is expected to be so immense that it’s hard to grasp. When Blue Gene was first announced, IBM officials told reporters that the new computer would have the processing capability to theoretically be able to download every page on the Web in less than a second. Blue Gene isn’t scheduled to begin operations until sometime after 2004, past its scheduled completion date. But when it is finished, it will be far faster than today’s fastest 500 computers in the world combined.
Blue Gene will have up to 1 million processors, working to solve complex modeling problems. The enormity of that scale is what first caused IBM researchers to rethink the computer’s design. “We will lose processors every day just from cosmic rays that enter our atmosphere and bombard the chips,” says Bill Pulleyblank, the IBM research executive in charge of Blue Gene. The vast array of processors, combined with the odds of large numbers, means that, on average, three processors a day would get zapped by radiation or fail for other reasons.
“So if I lose three a day, I may have lost 1,000 processors after one year,” says Pulleyblank. Out of 1 million processors, that might not seem like many. But computers aren’t built to cope with failed processors. When a processor fails, computing simply stops. No work assigned to that failed processor gets done — the computer just waits for a person to come fix or replace the processor. Other processors waiting for that work to be finished may also settle into waiting.
But when you’ve got a machine that cost $100 million to develop, a machine that, despite its awesome size, needs to run all the time to finish the kinds of problems that it’s working on, you can’t let losing one-tenth of 1% of your processors bring everything to a halt.
This is the problem that inspired SMASH. The Blue Gene team had two choices. The traditional approach would be to have technicians scramble from one end of the machine to the other every day, finding failing processors and swapping them out. The alternative choice: SMASH — the autonomic approach. “I have to operate with the assumption that any component may fail, unpredictably, ungracefully, at any time,” says Pulleyblank, “and I have to keep working. That is a fundamentally different approach to computer design: Assume you’ll have problems, assume you’ll have errors, and build in the ability to deal with them and keep working.”
Blue Gene will have the circuitry — the hardware — necessary to monitor itself. It will have a primitive form of self-awareness — the software — to understand how it is performing and to identify failures. And it will have the problem-solving ability and the physical components to reroute work and internal communications when things aren’t working right, or as processors fail.
The ultimate goal of autonomic functioning is to be able to tell computers what you want them to do and have them work out the details. In other words, to create a world where strategy and implementation are inseparable.
As arcane as that kind of semi-intelligent automation may seem, we take it for granted in our lives. An air-conditioning thermostat is autonomic in the sense that a person sets the desired temperature, and the electronics of the air conditioner maintain that temperature. The automatic transmission on a car interprets the instructions of the driver’s foot autonomically, compared with the operation of a manual transmission. On a much more sophisticated level, the telephone system — a vast array of interconnected equipment, networks, switches, and service providers — functions autonomically. It is both self-healing and virtually faultless. How often do you reboot your phone?
Combining Thinking and Doing
How might autonomic computing work in practice? Take the computers of a financial-services company. The company is constantly receiving transaction requests from its own brokers and directly from customers. The company’s computers also need to provide routine information to employees and customers: account balances, transaction updates, research information. And the computers need to tend to all kinds of back-office chores: keeping the company’s financial records, doing payroll, providing research information and email services.
Donna Dillenberger, a senior technical researcher who is developing some of IBM’s new autonomic software with financial-services companies, teases out what happens when you add autonomic ability to such a company’s computers. “Say you have three users trying to access the computers at the same time,” says Dillenberger. “One is a premium user. You want that customer to get a quick response time — under 3 seconds — every time. The second user isn’t a premium user, but you still want no more than a 10-second response time. Then there’s the user who has no limits on the response time. That person can wait.”
In the real world, the premium user might be a regular trading customer at a large brokerage company. The second user might be someone with an account, but not a customer who generates much revenue. The “no limits” user might be someone who has just come to the site to get free stock quotes or to access free stock-research information. In order for the system to work, it first has to have the capability to recognize different kinds of users and follow different rules for them.
“If all three users hit the computer at the same time,” says Dillenberger, “the first thing the computer does is figure out that it can’t handle all three requests simultaneously. It needs to make sure that the premium customer is satisfied first.” The computer may also be doing other things. At the instant of the three requests, there may not even be enough capacity to satisfy the premium customer.
“If there isn’t enough capacity, the computer asks, Why not? What’s the bottleneck? Am I out of processing power? Am I out of memory?” Dillenberger explains. “The computer picks the most likely source of the bottleneck, predicts what it needs to meet the goal, finds a place to get that resource, and satisfies the premium customer.”
Of course, all of the analysis and problem solving has to happen in the blink of an eye. Then the computer has to move on to satisfy the pretty-good customer and then to satisfy the lowest-ranking customer. And this is the simplest kind of computerized automation and problem solving, Dillenberger points out. In reality, the computers at a financial-services company wouldn’t be handling three requests at once — more like 3,000 or 30,000. With the Internet and with vast networks of computers and computerized equipment, the information and problem-solving challenges go up exponentially.
Pratap Pattnaik is head of the scalable-systems group inside IBM research. His group has recently developed a server with memory that automatically allocates the fastest memory chunks to the most important work. “Even in most high-end servers,” says Pattnaik, “we have to do this manually.” The point is for computers to become goal oriented. “If someone accidentally cuts a cable,” says Pattnaik, “what does ‘self-healing’ mean? Obviously, the computer doesn’t go to the factory and make a new cable. Its goal is to get you to the Web page that you clicked. The computer has to know the goal — Get to that Web server! — and route you through pathways to get there.”
Why has IBM turned autonomic computing into a virtual crusade? It has no choice: IBM alone can’t fix the problems. “We all live in an ecosystem,” says Pattnaik, “Even at IBM. Our customers may have IBM equipment, Cisco routers, Sun servers. I may have a really fast Ferrari, but I have to live within the ecosystem of the roads. If there’s a traffic jam, it doesn’t do me any good to have a Ferrari.”
But tiering customer service and allocating computer memory are easy compared with using computers as strategic tools. Rich Friedrich, director of one of Hewlett-Packard’s research labs, says that HP imagines a future where managers simply tell the computer, “Maximize revenue on this product line.” “That’s the kind of input we’re thinking about,” says Friedrich. “Not, ‘Configure this server with these parameters.’ “
Getting Beyond SMASH
Language can often be a leading indicator of the state of the art. And the language of automated computing is unsettled, as is the field itself. No single phrase has yet arisen to dominate the handful of concepts that everyone agrees on: that computers need to be more aware of their own capacities and functions, on the lookout for failures or hiccups of all kinds, able to gauge their environment and the kind of work that they are being asked to do, able to adjust themselves to maximize performance or meet specific goals, and able to work around those problems without human help.
Armando Fox, an assistant professor of computer science at Stanford University, is working with David Patterson, a colleague at the University of California, Berkeley. They label their work “ROC”: recovery-oriented computing. It’s a phrase that Fox credits to Patterson. ROC — pronounced, “rock” — is more tightly focused than IBM’s ideas. Says Fox: “Recovery-oriented computing means that failures happen. You can’t control that. And today’s software is not realistic. The question is, If recovery is the goal, how does that change the game?”
John Kubiatowicz, an assistant professor of computer science at UC Berkeley, has also been working on autonomic concepts for several years. He uses a phrase of his own creation: introspective computing. “Introspective computing is about using continuous monitoring, analysis, and feedback to adapt the system, to tune performance, and to make things more stable,” says Kubiatowicz. He has heard of ROC, of course. “I would say that’s more of a specific than an introspective computing,” he says, “but it’s an extremely important part of it.”
Introspective computing conjures up an image of a computer that might have to take time out for yoga in order to figure out how to solve a problem. Introspective computing is also a bit too meditative; most people want decisive computing.
IBM’s original phrase behind SMASH — simple, many, self-healing — has an appealingly democratic and holistic feeling. But as IBM has increased its focus on the ideas, it has moved beyond the acronym “SMASH” in favor of the phrase “autonomic computing.” “The more we thought about it,” says Pulleyblank, “smash is what you do to an egg. The idea of an autonomic system is a broad concept with a meaning of its own.”
The unspoken rivalry over terminology, though, masks a much wider, even surprising, consensus on the goals of autonomic computing. This is the moment when computer science shifts from focusing relentlessly on performance — How fast is the processor? How big is the hard drive? — to focusing on stability and reliability. The way IBM’s research organization was plunged into autonomic computing is indicative of both why and how the world of computer science is changing.
Paul Horn, a physicist by training, came to IBM from the University of Chicago two decades ago. After Horn took over as head of research for IBM in 1996, he did an assessment of how the R&D group was supporting the rest of IBM. By the late 1990s, about half of IBM’s revenue, profit, and employees were in the company’s global-services division, providing computing services and support to other companies, not simply selling them hardware and software. Of the R&D group’s $5 billion budget, Horn asked, how much is devoted to supporting global services? “Zippo,” says Horn. The scientist had a distinctly businessmanlike moment: “I said, ‘If we can’t provide anything to support global services — half the company, more than half the employees — maybe we ought to be half the size we are.’ “
It’s remarkable how quickly you get people to focus when you suggest cutting the budget in half. “For us, it was Business Survival 101,” says Horn. He also quickly discovered that throughout R&D there were people like Donna Dillenberger who were working on projects relevant to global services.
Horn also discovered that global services had problems that R&D might be able to tackle. Customers think that information technology is too expensive to manage, too cumbersome, and too flaky, especially given how important it’s become. And those things are as true for IBM’s own global-services group as they are for IBM’s more traditional customers.
“The question is, How do you provide services cheaply?” says Horn. “How do you grow the business without exploding the number of people? What we really need are systems to take the people out. We were being hit by the very complexity that we created and the difficulty in managing that complexity. What we needed were systems to manage it.”
The results — even in early testing of IBM products not due until September — have been impressive. At one company, says Dillenberger, autonomic workload-management software is handling chores previously done by systems administrators. The server room has just one person in it. “It used to have 20 people,” she says. “And the computers are working consistently better than they used to. The room is quiet. And those 19 people have moved on to manage other things.”
One of IBM’s early products is what Dillenberger calls a “visualization tool,” a program that analyzes data and illustrates how systems are performing. “We were working with a telecom company,” she says, “and while we were there, they had a problem with the remote handhelds used by their field-service employees.” The field technicians got their work orders sent remotely to their handhelds. For three days, that wireless link functioned improperly; techs weren’t getting their assignments.
“They were losing money, their customers weren’t getting service, and it escalated to a critical situation,” says Dillenberger. “They had people from four different IT departments — network, workstations, applications, and remote systems — working on it.” But they weren’t making much headway. Dillenberger’s group used the new software to tackle the problem. “With the way it discards irrelevant data, it was able to pinpoint the problem in seconds, instead of in days,” she says. The reaction of the systems people at the telecom company? “They want it right away,” says Dillenberger. “They don’t want to wait until September.”
In other words, they want to eliminate the gap between strategy and implementation.
Charles Fishman (firstname.lastname@example.org) is a senior editor based in Raleigh, North Carolina. Learn more about SMASH and autonomic computing on the Web (www.research.ibm.com/autonomic).