When David Leip first road tested an early version of the Internet as a computer-science grad student at the University of Guelph, he was intrigued but skeptical. He even remembers thinking to himself, “Gee, it’s a pity that nothing will ever come of this cool technology.” That was 10 years ago, and the entire Web, which was then text-based, could be surfed in a couple of hours.
Today, as corporate Webmaster for IBM, Leip, 34, runs one of the world’s largest Web sites. Last year, IBM.com generated $9.2 billion, or more than 10% of IBM’s $88.4 billion revenue. The site, which has grown to 4.5 million pages and is available in 31 languages in 59 countries, is a powerhouse version of that “cool technology” Leip glimpsed years ago.
Just last Monday, traffic to the site hit a record 861,000 home-page views, thanks to promotions surrounding the 20th anniversary of the PC. And yet, despite the heavy traffic, so far this year, IBM.com has maintained a nearly flawless Web presence — 99.998% uptime to be exact. The site was down for a total of nine minutes last April due to what Leip calls a “silly human error.” (Web sites that achieve “best-in-class” levels of service availability offer less than 9 hours of unplanned downtime and less than 12 hours of planned downtime a year, according to the Gartner Group. The company estimates that fewer than 5% of mission-critical services today reach that level of service.)
But Leip will be the first to tell you that the Web wasn’t always such a hit. When Leip joined IBM as a software developer in the Toronto office in 1992, the Web was barely on IBM’s radar screen. It fell to Leip and a small team of Internet evangelists to launch a grassroots effort to spread the Web gospel throughout the company.
Now, nine years later, Leip manages 22 Webmasters worldwide. And while he acknowledges that the position requires superb technical skills, he also believes that having a sense of humor is critical — if for no other reason than to keep him and his team sane in the face of unrelenting performance pressure.
“It’s a high-profile job within the company,” Leip says. “That can be a good thing and a bad thing, because you take a fair amount of heat when there’s a problem.” Sure, the first-ever outage to the IBM home page was stressful, he says, but that was nothing compared to the time Lou Gerstner, IBM’s CEO, contacted him to point out a glitch that he had discovered while surfing the site on his new BlackBerry. “In a big company like IBM,” says Leip, “a call from the CEO isn’t a common occurrence.”
Leip’s concern with site performance can be, well, a 24-7 obsession. “Sometimes, I have these horrible dreams that the site has been down for hours at a time. They’re the equivalent of that childhood nightmare when you go to school in your pajamas,” he says, laughing. “Then you wake up, look at your clock, and breathe a sigh of relief that it was only a dream.”
“Continuous availability is not just a nice thing to have,” he says. “We have customers who expect to be able to get to the site any time of the day. What makes the Web generally wonderful is that you aren’t limited to calling an 800-number between 9 AM and 5 PM.”
For Leip and his crack team of Webmasters, running a world-class site starts with a relentless commitment to keeping the site up and running — no matter what it takes. “Moving traffic to IBM.com is more important than ever,” Leip says. “If it can be done via the Web, most customers want to do it via the Web.”
Here are Leip’s tips for keeping a busy site up and running 99.998% of the time.
Build It to Keep Going and Going and Going
From the beginning, you have to build your system architecture with a high level of availability in mind. The system should be designed to run in a distributed environment, where you have multiple servers or processors within a location. That avoids having a single point that can fail. It also allows you to keep the site running while you deal with unplanned hardware failures and planned outages or maintenance. The customer isn’t disturbed by changes to the site, because the architecture is engineered to accommodate those changes.
Even more reliable is a distributed environment across multiple locations, which are mirrored in real time. That ensures that the site will continue to run, even if a catastrophe strikes. If a location shuts down because of a natural disaster, such as a fire or an earthquake, IBM.com continues to serve the world.
Multiple locations also mean better performance — faster load times — on average. IBM.com users are served from the location closest to them. If the infrastructure nearest you goes offline for some reason, you’re automatically routed to the next-closest location. IBM.com has three server facilities.
Assume You’ll Be Hacked, and Plan for Trouble
Hackers try to break into IBM.com several times a day. So we have to be very focused on security, locking down the system as much as possible. One way we do that is to make sure that we’re not running any unnecessary services and that the services we do run always have minimal privileges on the system. If, for example, hackers compromised a Web server and could gain root-level privileges (a high degree of access), then they could do anything on the site.
If they had limited privileges, then very little damage could be done. We try to limit any “super user” permission levels, so there are fewer vulnerabilities. Whenever possible, we also run our file systems as read-only file systems, so if people break in, they would have a really tough time changing data. Web-site defacements in general are on the rise, so we never can be too careful. But with any kind of software, there are always going to vulnerabilities. So we have a whole set of procedures for judging the severity of a potential threat and how quickly to roll out a solution.
Build a Culture That Obsesses About Uptime All the Time
For my team and the people who work with us, availability is a big deal. It’s a source of pride for everyone. It’s not David Leip’s Web site; it’s a team achievement. My job is to make sure that every single person feels as if he is a part of that team.
The outage earlier this year was disappointing, but I don’t want to overstate it. It was a nine-minute outage, and the site was already back online by the time my team tracked me down to tell me what happened.
We have a phenomenal record, but that doesn’t mean we’re going to stop going the extra mile. We’re after 100% uptime.
Monitor the Site Like a “Paranoid System-Architecture Guy”
We’ve got to keep an eye on exactly what is going on at any given moment on the system. Monitoring allows us to detect outages as they occur — and before users notice them — rather than responding to outages because they affect customers. We look for things like redundancy in the system. For example, if we have two network routers and one fails, the customer probably won’t notice because the other router will continue to work. But if nobody notices that the first router failed, then the problem persists. We constantly monitor not only the overall availability of the site but also how the different pieces in the environment work together. At the same time, we regularly watch traffic levels. If we know ahead of time that we’re going to have a surge in traffic, then we can add capacity to the system before the heavy load strains the system and creates an availability problem.
Nothing is fail-proof. We’re dealing with varying degrees of risk and probability. And we can always do more to reduce that risk. Undoubtedly, we will discover new vulnerabilities in the next few months — which can be nerve-racking, because we know that IBM and the rest of the world has been vulnerable for quite some time. Does that worry me? Absolutely. At the end of the day, you have to have faith in your systems and procedures, but you never can feel too safe. When you feel safe, you fall asleep at the wheel.
Christine Canabou (firstname.lastname@example.org) is a Fast Company staff writer.