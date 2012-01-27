advertisement

advertisement

This year will be the year of Big Data. The Data Warehousing

Institute (TDWI) reported that 90 percent of the IT professionals it surveyed said they were familiar with big data analytics. And 34 percent said they

already applied analytics to Big Data. The vast hordes of data collection

during e-commerce transactions, from loyalty programs, employment records,

supply chain and ERP systems are, or are about to get, cozy. Uncomfortably

cozy. Let me start by saying there is nothing inherently wrong

with Big Data. Big Data is a thing, and like anything, it can be used for good

or for evil. It can be used appropriately given known limitations, or stretched

wantonly until its principles fray. For now, the identification, consolidation, and governance of data is an appropriate step, as Forbes‘s Tom Groenfeldt recently

documented

with Michigan’s $19 million in data center consolidation savings. Dirk Helbing of the Swiss Federal Institute

of Technology in Zurich is more ambitious. His €1-billion

project, the topic of the December 2011 Scientific

American cover story, seeks to do nothing less than foretell the future. The meaningful use of Big Data lies somewhere between these

two extremes. For Big Data to move from anything more than an instantiation of

databases running in logical or physical proximity, to data that can be

meaningfully mined for insight, requires new skills, new perspectives, and new

cautions. The Big Data Dream Dirk Helbing seeks a system that is akin to Asimov’s

Psychohistory as imagined in the Foundation

series. In broad swaths, it would anticipate the future by linking social, scientific, and economic data. This system could be used to help advise world governments on the most

salient choices to make. Reading the article in Scientific

American reminded me of a science fiction story by

Tribble-inventor David Gerrold—When

Harlie Was One. In this book, Harlie, which stands for Human Analog Robot

Life Input Equivalents, decides that he needs answers, and that he isn’t

sophisticated enough to solve his own problem and therefore keep the corporate

interests that built him interested enough to keep him plugged in. So he

designs a new computer, the Graphic Omniscient Device, or GOD, as a proof of his

value. GOD will answer all questions submitted to it. Unfortunately, as the

human engineers building GOD eventually realize, the processing capacity is so

vast, that GOD will not be able to provide an answer to any question during the

lifetime of a human. Harlie, of course, knew this all along. He needed the

humans for three reasons: to keep him running, for engineering labor to build GOD, and to ask the questions that GOD will answer.

advertisement

Given the woes of Europe, spending €1-billion on such a project will likely prove to be wasted money. We, of course, don’t have a mechanical futurist to evaluate that

position, but we do have history. Whenever there is an existential problem

facing the world, charlatans appear to dazzle the masses with feats of magic

and wonder. I don’t see this proposal being anything more than the latest

version of apocalyptic sorcery. It’s not that a big science project can’t yield

interesting outcomes, but if you look at the Microelectronics and Computer Technology

Corporation (MCC), late of Austin, Texas, we find Cyc, a system conceived at

the beginning of the computer era, to combat Japan’s Fifth Generation Project as

it supposedly threatened to out-innovate America’s nascent lead in computer

technology. Although Cyc has yielded some use, it has not yet become the

artificial human mind it was intended to be, able to converse naturally with

anyone about the events, concepts, and objects in the world. And artificial

intelligence, as imagined in the 1980s, has yet to transform the human

condition. As Big Data becomes the next great savior of business and

humanity, we need to remain skeptical of its promises as well as its applications

and aspirations. Existential Issues With Big Data Determinism teaches that what will be, will be. Existentialism

deals with a humanity in the throes of chaos. Big Data can be seen as either a

lens through which determinism is revealed, or a tool for navigating an

existential world. As a scenario planner, I take the existential position and see

a number of existential threats to the success of Big Data and its applications. Overconfidence: Many

managers creating a project plan, drawing up a budget, or managing a hedge fund trust their

forecasts based on personal abilities and confidence in their knowledge and

experience. As University of Chicago professor Richard H. Thaler recently

pointed out in the New York Times (“The

Overconfidence Problem in Forecasting“), most managers are overconfident

and miscalibrated. In other words, they don’t recognize their own inability to

forecast the future, nor do they recognize the inherent volatility of markets.

Both of these traits portend big problems for Big Data as humans code their

assumptions about the world into algorithms: people don’t understand their personal

limitations, nor do they recognize if a model is good or not. When learning happens: Even in a field as seemingly physical and visceral as fossil hunting, Big

Data is playing a role. Geologic data has been fed into a model that helps

pinpoint good fossil-hunting fields. On the surface that appears a useful

discovery, but if you dig a bit deeper, you find a lesson for would-be Big Data

modelers. As technology and data sophistication increases, the underlying

assumptions in the model must change. Current data, derived from the analysis

of Landsat photos, can direct field workers toward a fairly large, but

promising area with multiple types of rock exposures. Eventually the team hopes

to increase their 15-meter resolution to 15-centimeter resolution by acquiring higher-resolution data. As they examine the new data, they will need to change their

analysis approach to recognize features not previously available (for more see “Artificial intelligence joins the fossil

hunt” in New Scientist). Learning will mean reinterpreting

the model.

advertisement

On a more abstract level, recent work conducted by ETH

Zurich looked at 43,000 transnational companies seeking to understand the relationships

between those companies and their potential for influence. This analysis found that

1,318 companies were tightly connected, with an average of 20 connections,

representing about 60 percent of global revenues. Deeper analysis revealed a “super-entity”

of 147 firms that accounts for about 40 percent of the wealth in the network.

This type of analysis has been conducted before, but the Zurich team included

indirect ownership, which changed the outcome significantly (for more see “The network of

global control” by Bitali, Glattfelder, and Battiston). If organizations rely on Big Data to connect far-ranging

databases–well beyond corporate ownership or maps of certain geologies–who,

it must be asked, will understand enough of the model to challenge its

underlying assumptions, and re-craft those assumptions when the world, and the

data that reflects it, changes? Complexity: I was

sitting with the CIO of a large insurance company in Portland. We were talking

about generational hand-offs when he raised the issue of an Excel spreadsheet

used to evaluate commercial property underwriting. He said one of the older

members of the organization owned that spreadsheet and he was the only one who knew how it worked. The hand-off issue was not one of getting the older

employee to collaborate with the younger employee, but one of complexity. That

spreadsheet was complex and tightly woven into the employee’s worldview.

Although the transfer could theoretically take place, it is unknowable how long

it would take, if the new employee would stay, or how the process would change

as multiple worldviews collided. Combining models full of nuance and obscurity increases complexity. Organizations that plan complex uses of Big Data and

the algorithms that analyze the data need to think about continuity and

succession planning in order to maintain the accuracy and relevance of their

models over time, and they need to be very cautious about the time it will take

to integrate, and the value of results achieved, from data and models that

border on the cryptic. Feedback Loops: Big Data isn’t just about the size of

well-understood data sets, it is about linking disparate data sets and then

creating connective tissue, either through design or inference, between these

data sets. At the onset of the Great Recession, we experienced a feedback loop

failure as David X Li’s famous Gaussian copula function, a seemingly well-tested

approach to analyzing financial risk, failed to anticipate the risks lurking

outside of its models. People kept trading, assuming their risk analysis was

still meaningful. No feedback loop existed to inform the bond markets that

their credit default swaps were an inverse pyramid teetering on a bed of

miscalculation. Algorithms and a Lack

of Theory: It is not only algorithms that can go wrong when a theory proves

incorrect or the assumptions underlying the algorithm change. There are places

where no theory exists at any level of consensus to be meaningful. The impact

of education (and the effectiveness of various approaches), how innovation

works, or what triggers a fad are examples of behaviors for which little valid

theory exists–it’s not that plenty of opinion about various approaches or

models is lacking, but that a theory, in the scientific sense, is nonexistent.

For Big Data that means a number of things, first and foremost, that if you don’t

have a working theory, you probably don’t know what data you need to test any

hypotheses you may posit. It also means that data scientists can’t create

a model because no reliable underlying logic exists that can be encoded into a

model. Confirmation Bias:

Every model is based on historical assumptions and perceptual biases.

Regardless of the sophistication of the science, we often create models that

help us see what we want to see, using data selected as a good indicator of

such a perception. Take a recent debate about how to price futures. Future

events are typically discounted using an exponential model that creates a

regular discount rate that eventually leads a value of zero for far-flung

events. Exponential discounting takes a deterministic view. A more existential view

comes from the proponents of hyperbolic discounting, which creates a preference

for rewards that arrive sooner than later. With hyperbolic discounting,

discounts of future events fall more gradually, leading to what might be called

“irrational behavior.” Another version is called the declining discount rate.

advertisement

This “discount” debate points out that even when a model

exists that is designed to aid in decision making about the future, that model

may involve contentious disagreements about its validity and alternative

approaches that yield very different results. These are important debates in

the world of Big Data. One group of modelers advocates for one approach, and

another group, an alternative approach, both using sophisticated data and black

boxes (as far as the uninitiated business person is concerned) to support their

cases. The fact is that in cases like this, no one knows the answer

definitively as the application may be contextual or it may be incomplete

(e.g., a new approach may solve the issue that none of the current approaches

solves completely). What can be said, and what must be remembered is, the adage

that “a futurist is never wrong today.” Who wins these debates today may be meaningless

because the implications have no near-term consequences, but companies that accept

one approach over the other may be betting their firm’s future on wishful

thinking and unwillingness to admit what they don’t know. The World Changes: We

must remember that all data is historical. There is no data from or about the

future. Future context changes cannot be built into a model because they cannot

be anticipated. Consider this: 2012 is the 50th anniversary of the

1962 Seattle World’s Fair. In 1962, the retail world was dominated by Sears, Montgomery

Ward, Woolworth, A&P, and Kresge. Some of those companies no longer exist,

and others have merged to the point that they are unrecognizable from their

1962 incarnations. Also in 1962, thousands of miles away, a small company

opened its first location in Rogers, Ark.–the Wal-Mart Discount City.

Would models of retail supply chains built in 1962 be able to anticipate the

overwhelming disruption that this humble storefront would cause for retail? Did Sam Walton understand the impact of Amazon.com when it went live in 1995? The answer to all of the above is “no.” These innovations are

rare and hugely disruptive. They are far from doomsday scenarios except for

firms so entrenched in their models that they can’t adapt. And there is the

problem for Big Data. As with Li’s risk analysis algorithm, when the world

changed, the model did not. Feedback loops are important as a way of

maintaining relevance through incremental improvement, but what happens when the

world changes so much that current assumptions become irrelevant and the clock

must be started again. Not only must we remember that all data is historical,

but we must also remember that at some point historical data becomes irrelevant

when the context changes. Motives: In a

recent BusinessWeek article (Palantir, “The

War on Terror’s Secret Weapon“), Peter Theil is quoted as saying:

“We cannot afford to have another 9/11 event in the U.S. or anything bigger

than that. That day opened the doors to all sorts of crazy abuses and draconian

policies.” Theil, characterized in the article as a libertarian, sees the

analysis of data as a positive for civil liberties. That position is debatable

as groups like NO2ID.net form to campaign

against not only the use of Big Data, but its creation. Given the complexity of

the data and associated models, along with various intended of unintended

biases, organizations have to go out of their way to discern the motives of

those developing analytics models, lest they allow programs to manipulate data

in a way that may precipitate negative social, legal, or fiduciary outcomes. Acting on the Model: Consider crime analysis. George Mohler of

Santa Clara University in California has applied equations that predict earthquake

aftershocks to crime. By using location and data and times of recent crimes,

the system predicts “aftercrimes.” This kind of anticipatory data may result in bastions of police flooding a neighborhood following one burglary. With no

police presence, the anticipated crimes may well take place. If the burglars,

however, see an increase in surveillance and police activity, they may abandon

planned targets and seek new ones, thus invalidating the models’ predictions,

potentially in terms of time and location. The proponents of Big Data need to

ensure that the users of their models understand the intricacies of trend

analysis, what a trend really is, and the implications of acting on a model’s

recommendations. Where Big Data Will

Work

advertisement

Some of the emerging Big Data stories don’t test the

existential limits of technology, nor do they threaten global catastrophe. The

worst outcome in dinosaur fossil hunting is not finding dinosaur bones where

you expect them, and the worst outcome of a crime prediction is a burglary that

doesn’t take place. Big Data will no doubt be used to target advertising, reduce

fraud, fight crime, find tax evaders, collect child support payments, create

better health outcomes, and myrid other activities from the mundane to the

ridiculous. And along the way, the software companies and those who invested in

Big Data will share their stories. Companies like monumental constructor Arup use Big Data as

a way to better model the use of the buildings they build. The Arup software

arm, Oasys, recently acquired MassMotion to help them understand the flow of

people through buildings. MassMotion can model the intimacy of a coffee shop or

the flow of hundreds of thousands through a terminal or metro system using its

agent technology. His models are informed from data about traffic patterns,

arrival and departure times for various forms of transportation and the

environment, such as shopping locations and information desks where people

might stop or congregate. The result is

a model, sometimes with thousands of avatars, pushing and shoving, congregating

and separating–all based on MassMotion’s Erin Morrow and how he perceives the

world. Another movement oriented application of Big Data, Jyotish

(Sanskrit for astrology), comes from Boeing’s research cener at the University of

Illinois in Urbana-Champaign. This application predicts the movement of work

crews within Boeing’s factories. It will ultimately help them figure out how to

save costs and increase satisfaction by ensuring that services, like Wi-Fi, are

available where and when they are needed. Palantir, the Palo Alto-based startup focused on solving

the intelligence problem of 9/11, discovers correlations in the data that informs

military and intelligence agencies who, what, and when a potential threat turns

into an imminent threat. With access to data models and data across government

silos, Palantir may well make its hero case more often than not in individual

cases. They have to be cautious about applying their ideas to different domains

where underlying rules might not be so clear or data so well-defined. For some fields, like biology, placing large data sets into

open source areas may bring a kind of convergence as collaboration ensues. But

as Michael Nielsen points out in Reinventing

Discovery, scientists have very little motivation to collaborate given the

nature of publication, reputation, and tenure.

advertisement

I seriously doubt that we have the intellectual infrastructure to

support the collaborative capabilities of the Internet. We may well be able to

connect all sorts of data and run all kinds of analyzes, but in the end, we may

not be equipped to apply the technology in a meaningful and safe way at scales

that outstrip our ability to represent, understand, and validate the models and

their data. A Telling Story In the Scientific

American article, Helbing relates an old story. A drunk man is looking for

his keys under a street lamp. When asked why, he responds, “that’s where the

light is.” To Helbing, it appears that he wants to use Big Data to create a

brighter light so that scientists can peer beneath if for insight. Scenario

planners tell that story too, but we do so to make the point that while one is

looking for keys in the light, they aren’t paying any attention to the

darkness. The future of Big Data lies not in the stories of anecdotal triumph

that report sophisticated, but limited accomplishments–no, the future of Big

Data rather lies in the darkness of context change, complexity, and overconfidence. I will end, as Thaler did in his New York Times article, by quoting Mark Twain: “It ain’t what you

don’t know that gets you into trouble. It’s what you know for sure that just

ain’t so.” For more leadership coverage, follow us on Twitter and LinkedIn. [Image: Flickr user Purplemattfish]