Why Big Data Won’t Make You Smart, Rich, Or Pretty

This year will be the year of Big Data. The Data Warehousing
Institute (TDWI) reported that 90 percent of the IT professionals it surveyed said they were familiar with big data analytics. And 34 percent said they
already applied analytics to Big Data. The vast hordes of data collection
during e-commerce transactions, from loyalty programs, employment records,
supply chain and ERP systems are, or are about to get, cozy. Uncomfortably
cozy.

Let me start by saying there is nothing inherently wrong
with Big Data. Big Data is a thing, and like anything, it can be used for good
or for evil. It can be used appropriately given known limitations, or stretched
wantonly until its principles fray. For now, the identification, consolidation, and governance of data is an appropriate step, as Forbes‘s Tom Groenfeldt recently
documented
with Michigan’s $19 million in data center consolidation savings. Dirk Helbing of the Swiss Federal Institute
of Technology in Zurich is more ambitious. His €1-billion
project, the topic of the December 2011 Scientific
American cover story, seeks to do nothing less than foretell the future.

The meaningful use of Big Data lies somewhere between these
two extremes. For Big Data to move from anything more than an instantiation of
databases running in logical or physical proximity, to data that can be
meaningfully mined for insight, requires new skills, new perspectives, and new
cautions.

The Big Data Dream

Dirk Helbing seeks a system that is akin to Asimov’s
Psychohistory as imagined in the Foundation
series. In broad swaths, it would anticipate the future by linking social, scientific, and economic data. This system could be used to help advise world governments on the most
salient choices to make.

Reading the article in Scientific
American reminded me of a science fiction story by
Tribble-inventor David Gerrold—When
Harlie Was One. In this book, Harlie, which stands for Human Analog Robot
Life Input Equivalents, decides that he needs answers, and that he isn’t
sophisticated enough to solve his own problem and therefore keep the corporate
interests that built him interested enough to keep him plugged in. So he
designs a new computer, the Graphic Omniscient Device, or GOD, as a proof of his
value. GOD will answer all questions submitted to it. Unfortunately, as the
human engineers building GOD eventually realize, the processing capacity is so
vast, that GOD will not be able to provide an answer to any question during the
lifetime of a human. Harlie, of course, knew this all along. He needed the
humans for three reasons: to keep him running, for engineering labor to build GOD, and to ask the questions that GOD will answer.

Given the woes of Europe, spending €1-billion on such a project will likely prove to be wasted money. We, of course, don’t have a mechanical futurist to evaluate that
position, but we do have history. Whenever there is an existential problem
facing the world, charlatans appear to dazzle the masses with feats of magic
and wonder. I don’t see this proposal being anything more than the latest
version of apocalyptic sorcery. It’s not that a big science project can’t yield
interesting outcomes, but if you look at the Microelectronics and Computer Technology
Corporation (MCC), late of Austin, Texas, we find Cyc, a system conceived at
the beginning of the computer era, to combat Japan’s Fifth Generation Project as
it supposedly threatened to out-innovate America’s nascent lead in computer
technology. Although Cyc has yielded some use, it has not yet become the
artificial human mind it was intended to be, able to converse naturally with
anyone about the events, concepts, and objects in the world. And artificial
intelligence, as imagined in the 1980s, has yet to transform the human
condition.

As Big Data becomes the next great savior of business and
humanity, we need to remain skeptical of its promises as well as its applications
and aspirations.

Existential Issues With Big Data

Determinism teaches that what will be, will be. Existentialism
deals with a humanity in the throes of chaos. Big Data can be seen as either a
lens through which determinism is revealed, or a tool for navigating an
existential world. As a scenario planner, I take the existential position and see
a number of existential threats to the success of Big Data and its applications.

Overconfidence: Many
managers creating a project plan, drawing up a budget, or managing a hedge fund trust their
forecasts based on personal abilities and confidence in their knowledge and
experience. As University of Chicago professor Richard H. Thaler recently
pointed out in the New York Times (“The
Overconfidence Problem in Forecasting“), most managers are overconfident
and miscalibrated. In other words, they don’t recognize their own inability to
forecast the future, nor do they recognize the inherent volatility of markets.
Both of these traits portend big problems for Big Data as humans code their
assumptions about the world into algorithms: people don’t understand their personal
limitations, nor do they recognize if a model is good or not.

When learning happens: Even in a field as seemingly physical and visceral as fossil hunting, Big
Data is playing a role. Geologic data has been fed into a model that helps
pinpoint good fossil-hunting fields. On the surface that appears a useful
discovery, but if you dig a bit deeper, you find a lesson for would-be Big Data
modelers. As technology and data sophistication increases, the underlying
assumptions in the model must change. Current data, derived from the analysis
of Landsat photos, can direct field workers toward a fairly large, but
promising area with multiple types of rock exposures. Eventually the team hopes
to increase their 15-meter resolution to 15-centimeter resolution by acquiring higher-resolution data. As they examine the new data, they will need to change their
analysis approach to recognize features not previously available (for more see “Artificial intelligence joins the fossil
hunt” in New Scientist). Learning will mean reinterpreting
the model.

On a more abstract level, recent work conducted by ETH
Zurich looked at 43,000 transnational companies seeking to understand the relationships
between those companies and their potential for influence. This analysis found that
1,318 companies were tightly connected, with an average of 20 connections,
representing about 60 percent of global revenues. Deeper analysis revealed a “super-entity”
of 147 firms that accounts for about 40 percent of the wealth in the network.
This type of analysis has been conducted before, but the Zurich team included
indirect ownership, which changed the outcome significantly (for more see “The network of
global control” by Bitali, Glattfelder, and Battiston).

If organizations rely on Big Data to connect far-ranging
databases–well beyond corporate ownership or maps of certain geologies–who,
it must be asked, will understand enough of the model to challenge its
underlying assumptions, and re-craft those assumptions when the world, and the
data that reflects it, changes?

Complexity: I was
sitting with the CIO of a large insurance company in Portland. We were talking
about generational hand-offs when he raised the issue of an Excel spreadsheet
used to evaluate commercial property underwriting. He said one of the older
members of the organization owned that spreadsheet and he was the only one who knew how it worked. The hand-off issue was not one of getting the older
employee to collaborate with the younger employee, but one of complexity. That
spreadsheet was complex and tightly woven into the employee’s worldview.
Although the transfer could theoretically take place, it is unknowable how long
it would take, if the new employee would stay, or how the process would change
as multiple worldviews collided. Combining models full of nuance and obscurity increases complexity. Organizations that plan complex uses of Big Data and
the algorithms that analyze the data need to think about continuity and
succession planning in order to maintain the accuracy and relevance of their
models over time, and they need to be very cautious about the time it will take
to integrate, and the value of results achieved, from data and models that
border on the cryptic.

Feedback Loops: Big Data isn’t just about the size of
well-understood data sets, it is about linking disparate data sets and then
creating connective tissue, either through design or inference, between these
data sets. At the onset of the Great Recession, we experienced a feedback loop
failure as David X Li’s famous Gaussian copula function, a seemingly well-tested
approach to analyzing financial risk, failed to anticipate the risks lurking
outside of its models. People kept trading, assuming their risk analysis was
still meaningful. No feedback loop existed to inform the bond markets that
their credit default swaps were an inverse pyramid teetering on a bed of
miscalculation.

Algorithms and a Lack
of Theory: It is not only algorithms that can go wrong when a theory proves
incorrect or the assumptions underlying the algorithm change. There are places
where no theory exists at any level of consensus to be meaningful. The impact
of education (and the effectiveness of various approaches), how innovation
works, or what triggers a fad are examples of behaviors for which little valid
theory exists–it’s not that plenty of opinion about various approaches or
models is lacking, but that a theory, in the scientific sense, is nonexistent.
For Big Data that means a number of things, first and foremost, that if you don’t
have a working theory, you probably don’t know what data you need to test any
hypotheses you may posit. It also means that data scientists can’t create
a model because no reliable underlying logic exists that can be encoded into a
model.

Confirmation Bias:
Every model is based on historical assumptions and perceptual biases.
Regardless of the sophistication of the science, we often create models that
help us see what we want to see, using data selected as a good indicator of
such a perception. Take a recent debate about how to price futures. Future
events are typically discounted using an exponential model that creates a
regular discount rate that eventually leads a value of zero for far-flung
events. Exponential discounting takes a deterministic view. A more existential view
comes from the proponents of hyperbolic discounting, which creates a preference
for rewards that arrive sooner than later. With hyperbolic discounting,
discounts of future events fall more gradually, leading to what might be called
“irrational behavior.” Another version is called the declining discount rate.

This “discount” debate points out that even when a model
exists that is designed to aid in decision making about the future, that model
may involve contentious disagreements about its validity and alternative
approaches that yield very different results. These are important debates in
the world of Big Data. One group of modelers advocates for one approach, and
another group, an alternative approach, both using sophisticated data and black
boxes (as far as the uninitiated business person is concerned) to support their
cases. The fact is that in cases like this, no one knows the answer
definitively as the application may be contextual or it may be incomplete
(e.g., a new approach may solve the issue that none of the current approaches
solves completely). What can be said, and what must be remembered is, the adage
that “a futurist is never wrong today.” Who wins these debates today may be meaningless
because the implications have no near-term consequences, but companies that accept
one approach over the other may be betting their firm’s future on wishful
thinking and unwillingness to admit what they don’t know.

The World Changes: We
must remember that all data is historical. There is no data from or about the
future. Future context changes cannot be built into a model because they cannot
be anticipated. Consider this: 2012 is the 50th anniversary of the
1962 Seattle World’s Fair. In 1962, the retail world was dominated by Sears, Montgomery
Ward, Woolworth, A&P, and Kresge. Some of those companies no longer exist,
and others have merged to the point that they are unrecognizable from their
1962 incarnations. Also in 1962, thousands of miles away, a small company
opened its first location in Rogers, Ark.–the Wal-Mart Discount City.
Would models of retail supply chains built in 1962 be able to anticipate the
overwhelming disruption that this humble storefront would cause for retail? Did Sam Walton understand the impact of Amazon.com when it went live in 1995?

The answer to all of the above is “no.” These innovations are
rare and hugely disruptive. They are far from doomsday scenarios except for
firms so entrenched in their models that they can’t adapt. And there is the
problem for Big Data. As with Li’s risk analysis algorithm, when the world
changed, the model did not. Feedback loops are important as a way of
maintaining relevance through incremental improvement, but what happens when the
world changes so much that current assumptions become irrelevant and the clock
must be started again. Not only must we remember that all data is historical,
but we must also remember that at some point historical data becomes irrelevant
when the context changes.

Motives: In a
recent BusinessWeek article (Palantir, “The
War on Terror’s Secret Weapon“), Peter Theil is quoted as saying:
“We cannot afford to have another 9/11 event in the U.S. or anything bigger
than that. That day opened the doors to all sorts of crazy abuses and draconian
policies.” Theil, characterized in the article as a libertarian, sees the
analysis of data as a positive for civil liberties. That position is debatable
as groups like NO2ID.net form to campaign
against not only the use of Big Data, but its creation. Given the complexity of
the data and associated models, along with various intended of unintended
biases, organizations have to go out of their way to discern the motives of
those developing analytics models, lest they allow programs to manipulate data
in a way that may precipitate negative social, legal, or fiduciary outcomes.

Acting on the Model: Consider crime analysis. George Mohler of
Santa Clara University in California has applied equations that predict earthquake
aftershocks to crime. By using location and data and times of recent crimes,
the system predicts “aftercrimes.” This kind of anticipatory data may result in bastions of police flooding a neighborhood following one burglary. With no
police presence, the anticipated crimes may well take place. If the burglars,
however, see an increase in surveillance and police activity, they may abandon
planned targets and seek new ones, thus invalidating the models’ predictions,
potentially in terms of time and location. The proponents of Big Data need to
ensure that the users of their models understand the intricacies of trend
analysis, what a trend really is, and the implications of acting on a model’s
recommendations.

Where Big Data Will
Work

Some of the emerging Big Data stories don’t test the
existential limits of technology, nor do they threaten global catastrophe. The
worst outcome in dinosaur fossil hunting is not finding dinosaur bones where
you expect them, and the worst outcome of a crime prediction is a burglary that
doesn’t take place.

Big Data will no doubt be used to target advertising, reduce
fraud, fight crime, find tax evaders, collect child support payments, create
better health outcomes, and myrid other activities from the mundane to the
ridiculous. And along the way, the software companies and those who invested in
Big Data will share their stories.

Companies like monumental constructor Arup use Big Data as
a way to better model the use of the buildings they build. The Arup software
arm, Oasys, recently acquired MassMotion to help them understand the flow of
people through buildings. MassMotion can model the intimacy of a coffee shop or
the flow of hundreds of thousands through a terminal or metro system using its
agent technology. His models are informed from data about traffic patterns,
arrival and departure times for various forms of transportation and the
environment, such as shopping locations and information desks where people
might stop or congregate. The result is
a model, sometimes with thousands of avatars, pushing and shoving, congregating
and separating–all based on MassMotion’s Erin Morrow and how he perceives the
world.

Another movement oriented application of Big Data, Jyotish
(Sanskrit for astrology), comes from Boeing’s research cener at the University of
Illinois in Urbana-Champaign. This application predicts the movement of work
crews within Boeing’s factories. It will ultimately help them figure out how to
save costs and increase satisfaction by ensuring that services, like Wi-Fi, are
available where and when they are needed.

Palantir, the Palo Alto-based startup focused on solving
the intelligence problem of 9/11, discovers correlations in the data that informs
military and intelligence agencies who, what, and when a potential threat turns
into an imminent threat. With access to data models and data across government
silos, Palantir may well make its hero case more often than not in individual
cases. They have to be cautious about applying their ideas to different domains
where underlying rules might not be so clear or data so well-defined.

For some fields, like biology, placing large data sets into
open source areas may bring a kind of convergence as collaboration ensues. But
as Michael Nielsen points out in Reinventing
Discovery, scientists have very little motivation to collaborate given the
nature of publication, reputation, and tenure.

I seriously doubt that we have the intellectual infrastructure to
support the collaborative capabilities of the Internet. We may well be able to
connect all sorts of data and run all kinds of analyzes, but in the end, we may
not be equipped to apply the technology in a meaningful and safe way at scales
that outstrip our ability to represent, understand, and validate the models and
their data.

A Telling Story

In the Scientific
American article, Helbing relates an old story. A drunk man is looking for
his keys under a street lamp. When asked why, he responds, “that’s where the
light is.” To Helbing, it appears that he wants to use Big Data to create a
brighter light so that scientists can peer beneath if for insight. Scenario
planners tell that story too, but we do so to make the point that while one is
looking for keys in the light, they aren’t paying any attention to the
darkness. The future of Big Data lies not in the stories of anecdotal triumph
that report sophisticated, but limited accomplishments–no, the future of Big
Data rather lies in the darkness of context change, complexity, and overconfidence.

I will end, as Thaler did in his New York Times article, by quoting Mark Twain: “It ain’t what you
don’t know that gets you into trouble. It’s what you know for sure that just
ain’t so.”

For more leadership coverage, follow us on Twitter and LinkedIn.

[Image: Flickr user Purplemattfish]

Recognize your brand’s excellence by applying to this year’s Brands That Matter Awards before the early-rate deadline, May 3.

Explore Topics