A few years ago, I found myself in a room full of electrical engineers discussing the smart grid. One thing was strikingly absent at this event: screens. The vast majority of these engineers were taking notes on paper. It soon became clear that this was a very different world to that of Internet startups where speed, scale, and the shock of the new was all that mattered. The men, and they are mainly men, who run the world’s power plants and electricity grids care about something else entirely: safety and reliability of supply.
To secure those resources you need a lot more than strong passwords. Networked industrial installations like power plants may be threatened by cyber attacks, natural disasters, or simple machine failures. "I don’t really care if the reason that my generator is about to fail is because of a virus or Stuxnet (an attack first unleashed on a Iranian nuclear facility) or something else," says Amir Husain, the CEO of Sparkcognition.
Those security threats need to be detected in a new way. That's where machine learning, algorithms which learn from data, come in.
Losing power on a large scale is more than an inconvenience. The world's biggest blackout was suffered by India in 2012, when 620 million people lost power. Miners were trapped underground. Delhi's metro shut down and had to be evacuated. Hospitals had to resort to generators. If Facebook goes down for a day, who really suffers apart from advertisers? If the grid fails, people die and businesses lose billions. Without electricity, nothing else works.
"You don’t buy a $50 or $100 million asset (like the massive turbines which generate energy in a power plant) and be dying for the version that comes out in three months to upgrade it," says Brian Courtney, product line leader at GE Critical Power. "Forty years ago, you buy a part, it’s still in operation today. The pace of change is much much slower."
Welcome to the world of the Industrial Internet.
How does data collection, then, work for power grids? It starts with turbines: 30% of the world’s power is generated using GE's turbines, according to Courtney. Those turbines, and many other forms of industrial equipment, are instrumented with sensors. "Every time you come up with a new turbine, it has twice the sensors the last turbine did," says Courtney. "In the industrial segment there’s between 20 and 30 billion devices that are connected and they think that will get to 75 billion by 2020. Industrial data intelligence is really, 'How do we get meaning from the information that comes from equipment so we can make that equipment work more effectively and efficiently, make it safer, make it more reliable?"
Husain calls this big analog data. Mechanical measurements like vibration, temperature, and pressure are analog: continuous signals rather than discrete values. And companies like Sparkcognition have sprung up to help secure grids with all of this information.
In particular, Sparkcognition uses that data and machine learning to detect threats to physical infrastructure and automatically defend it. One of the tools it uses is IBM’s mastermind AI system Watson.
This mode of working represents a sea change in cybersecurity operations. Traditional software security is signature-based. A human expert reverse-engineers a piece of malware like a virus and derives a signature for it. "If this file is on your system, it’s a bad file and here is what it looks like," says Husain. That signature is then distributed so the attack can be recognized. That model no longer works when there are a huge number of different attacks. "There are anywhere between 25,000 and 40,000 zero-day attacks happening every day," says Husain. A zero-day attack has no known signature.
Software security is also mostly in-band; it pairs the asset being protected with a defense, e.g. the firewall in front of your computer network, which intercepts attacks. "By doing that you also expose the means of defense," says Husain. "What we talk about is out of band security. We know that it is possible to simulate physical devices by looking at the data which is now coming from those physical devices. Instead of trying to create a better firewall, instead of trying to create a better anti-virus program, we just look at data and from that data create a simulation. Now we know the expected behavior of a system and when that system deviates from that expected behavior either due to a threat or due to a natural failure." This also means that the simulation, or model, can constantly be adjusted based on new data. The model learns.
Machines may be able to handle large swathes of data with ease but learning a model of a complex machine or an entire industrial installation, with all its possible causes of failure, is still not straightforward. "Even with something as simple as a motor with a fan there can be failures of multiple types," says Husain. "A bearing can fail. You can have a voltage problem which causes current to not properly flow. You can have an overload on the shaft. You can have a perturbation or something which is coming in the way of a fan."
Sparkcognition’s system first gathers huge amounts of information about the system being modeled: structured data in the form of sensor, health, and maintenance measurements and semi-structured data from machine and application logs. The software then spins up multiple neural networks, learning algorithms inspired by the networks of neurons in the human brain and how they recognize patterns in data. A competition is organized between the various neural networks using concepts from genetic algorithms. A genetic algorithm takes an initial population of solutions and evolves better ones over multiple "generations" by cross-pollinating and mutating those solutions.
The data provides an initial set of features, or properties of the system being modeled. One thing the competing neural networks try to determine is which are the most important properties when it comes to detecting a particular threat or failure? Each neural network will learn from a different subset of features, including higher order features which are random linear combinations of the basic or first-order features. "Some will become masters of the art of figuring out when there is some sort of protrusion that is impacting the fan," says Husain. "Others will become masters in the art of picking up of things like arcing (discharge of electric current across a gap) when a connection is loose."
The networks learn in an unsupervised way, meaning that they don't know what they are trying to learn; they just try to find structure in the data. The competing networks certainly don't know what arcing or vibration analysis is. "Yet neural networks are evolving into those domain experts," says Husain.
Most physical systems have alert thresholds for particular measurements like pressure or vibration. Exceeding them indicates trouble. "So as we find a neural network making a projection that starts to tend towards a violation of any one of those thresholds, then we know that the behavior and the features that neural network is looking at are the ones that are causing the machine to go out of whack," Husain explains.
Once a model built from the best neural networks finds an anomaly in the sensor or other data, a second opinion is sought from IBM's Watson. Machine learning systems often generate many false positives, which in this case means flagging a problem when there isn’t one. Industrial equipment is expensive to shut down. "The whole game now is how do you get to get to no unplanned downtime. How do I fix things when it’s the minimum possible cost?" says Courtney. Neural networks are also black boxes. They are often very accurate in their predictions, but they can’t tell you why they are predicting something.
"Watson is our in-context advisory service," says Husain. "Watson is the system that we train on the manuals, the existing industry literature pertaining to all of the assets that we monitor." While the neural networks are trained on the kind of structured and semi-structured data more easily understood by machines, Watson can learn from research papers, manuals, and other literature written by humans, what Machine Learning researchers call natural language.
Sparkcongnition sends natural language queries to Watson. Watson reads all the relevant literature and returns an answer including how confident it is in that answer. It also presents evidence for its conclusion in natural language. Watson can often even tell Sparkcognition how to fix the problem detected, based on the technical literature.
Sparkcognition now works with some of the largest energy companies in the U.S, one of whom monitors a considerable number of hydraulic pumps. Leaks in those pumps lead to failures which are usually detected when a maintenance crew physically spots the leak. But that can mean a delay of up to a week before a leak is noticed. "So we were given data from these pumps," says Husain. "And the company said, ‘Look, we were aware of one failure that happened in this data — and this data spanned many months — and our human data scientists can find this failure. Can your algorithms find it?’"
Sparkcognition’s staff ran their model and were able to determine exactly when the leak occurred (the company itself only knew the general time window) and why.
But there was something else in the data: a second failure. "They had simply not caught it, "says Husain. "Data can sometimes tell you more than even human observation can."