This year will be the year of Big Data. The Data Warehousing Institute (TDWI) reported that 90 percent of the IT professionals it surveyed  said they were familiar with big data analytics. And 34 percent said they already applied analytics to Big Data. The vast hordes of data collection during e-commerce transactions, from loyalty programs, employment records, supply chain and ERP systems are, or are about to get, cozy. Uncomfortably cozy.
Let me start by saying there is nothing inherently wrong with Big Data. Big Data is a thing, and like anything, it can be used for good or for evil. It can be used appropriately given known limitations, or stretched wantonly until its principles fray. For now, the identification, consolidation, and governance of data is an appropriate step, as Forbes's Tom Groenfeldt recently documented with Michigan’s $19 million in data center consolidation savings. Dirk Helbing of the Swiss Federal Institute of Technology in Zurich is more ambitious. His €1-billion project, the topic of the December 2011 Scientific American cover story , seeks to do nothing less than foretell the future.
The meaningful use of Big Data lies somewhere between these two extremes. For Big Data to move from anything more than an instantiation of databases running in logical or physical proximity, to data that can be meaningfully mined for insight, requires new skills, new perspectives, and new cautions.
The Big Data Dream
Dirk Helbing seeks a system that is akin to Asimov’s Psychohistory as imagined in the Foundation series. In broad swaths, it would anticipate the future by linking social, scientific, and economic data. This system could be used to help advise world governments on the most salient choices to make.
Reading the article in Scientific American reminded me of a science fiction story by Tribble-inventor David Gerrold—When Harlie Was One . In this book, Harlie, which stands for Human Analog Robot Life Input Equivalents, decides that he needs answers, and that he isn’t sophisticated enough to solve his own problem and therefore keep the corporate interests that built him interested enough to keep him plugged in. So he designs a new computer, the Graphic Omniscient Device, or GOD, as a proof of his value. GOD will answer all questions submitted to it. Unfortunately, as the human engineers building GOD eventually realize, the processing capacity is so vast, that GOD will not be able to provide an answer to any question during the lifetime of a human. Harlie, of course, knew this all along. He needed the humans for three reasons: to keep him running, for engineering labor to build GOD, and to ask the questions that GOD will answer.
Given the woes of Europe, spending €1-billion on such a project will likely prove to be wasted money. We, of course, don’t have a mechanical futurist to evaluate that position, but we do have history. Whenever there is an existential problem facing the world, charlatans appear to dazzle the masses with feats of magic and wonder. I don’t see this proposal being anything more than the latest version of apocalyptic sorcery. It’s not that a big science project can’t yield interesting outcomes, but if you look at the Microelectronics and Computer Technology Corporation (MCC), late of Austin, Texas, we find Cyc, a system conceived at the beginning of the computer era, to combat Japan’s Fifth Generation Project as it supposedly threatened to out-innovate America’s nascent lead in computer technology. Although Cyc has yielded some use, it has not yet become the artificial human mind it was intended to be, able to converse naturally with anyone about the events, concepts, and objects in the world. And artificial intelligence, as imagined in the 1980s, has yet to transform the human condition.
As Big Data becomes the next great savior of business and humanity, we need to remain skeptical of its promises as well as its applications and aspirations.
Existential Issues With Big Data
Determinism teaches that what will be, will be. Existentialism deals with a humanity in the throes of chaos. Big Data can be seen as either a lens through which determinism is revealed, or a tool for navigating an existential world. As a scenario planner, I take the existential position and see a number of existential threats to the success of Big Data and its applications.
Overconfidence: Many managers creating a project plan, drawing up a budget, or managing a hedge fund trust their forecasts based on personal abilities and confidence in their knowledge and experience. As University of Chicago professor Richard H. Thaler recently pointed out in the New York Times ("The Overconfidence Problem in Forecasting "), most managers are overconfident and miscalibrated. In other words, they don’t recognize their own inability to forecast the future, nor do they recognize the inherent volatility of markets. Both of these traits portend big problems for Big Data as humans code their assumptions about the world into algorithms: people don’t understand their personal limitations, nor do they recognize if a model is good or not.
When learning happens: Even in a field as seemingly physical and visceral as fossil hunting, Big Data is playing a role. Geologic data has been fed into a model that helps pinpoint good fossil-hunting fields. On the surface that appears a useful discovery, but if you dig a bit deeper, you find a lesson for would-be Big Data modelers. As technology and data sophistication increases, the underlying assumptions in the model must change. Current data, derived from the analysis of Landsat photos, can direct field workers toward a fairly large, but promising area with multiple types of rock exposures. Eventually the team hopes to increase their 15-meter resolution to 15-centimeter resolution by acquiring higher-resolution data. As they examine the new data, they will need to change their analysis approach to recognize features not previously available (for more see "Artificial intelligence joins the fossil hunt " in New Scientist). Learning will mean reinterpreting the model.
On a more abstract level, recent work conducted by ETH Zurich looked at 43,000 transnational companies seeking to understand the relationships between those companies and their potential for influence. This analysis found that 1,318 companies were tightly connected, with an average of 20 connections, representing about 60 percent of global revenues. Deeper analysis revealed a “super-entity” of 147 firms that accounts for about 40 percent of the wealth in the network. This type of analysis has been conducted before, but the Zurich team included indirect ownership, which changed the outcome significantly (for more see "The network of global control " by Bitali, Glattfelder, and Battiston).
If organizations rely on Big Data to connect far-ranging databases--well beyond corporate ownership or maps of certain geologies--who, it must be asked, will understand enough of the model to challenge its underlying assumptions, and re-craft those assumptions when the world, and the data that reflects it, changes?
Complexity: I was sitting with the CIO of a large insurance company in Portland. We were talking about generational hand-offs when he raised the issue of an Excel spreadsheet used to evaluate commercial property underwriting. He said one of the older members of the organization owned that spreadsheet and he was the only one who knew how it worked. The hand-off issue was not one of getting the older employee to collaborate with the younger employee, but one of complexity. That spreadsheet was complex and tightly woven into the employee's worldview. Although the transfer could theoretically take place, it is unknowable how long it would take, if the new employee would stay, or how the process would change as multiple worldviews collided. Combining models full of nuance and obscurity increases complexity. Organizations that plan complex uses of Big Data and the algorithms that analyze the data need to think about continuity and succession planning in order to maintain the accuracy and relevance of their models over time, and they need to be very cautious about the time it will take to integrate, and the value of results achieved, from data and models that border on the cryptic.
Feedback Loops: Big Data isn’t just about the size of well-understood data sets, it is about linking disparate data sets and then creating connective tissue, either through design or inference, between these data sets. At the onset of the Great Recession, we experienced a feedback loop failure as David X Li’s famous Gaussian copula function, a seemingly well-tested approach to analyzing financial risk, failed to anticipate the risks lurking outside of its models. People kept trading, assuming their risk analysis was still meaningful. No feedback loop existed to inform the bond markets that their credit default swaps were an inverse pyramid teetering on a bed of miscalculation.
Algorithms and a Lack of Theory: It is not only algorithms that can go wrong when a theory proves incorrect or the assumptions underlying the algorithm change. There are places where no theory exists at any level of consensus to be meaningful. The impact of education (and the effectiveness of various approaches), how innovation works, or what triggers a fad are examples of behaviors for which little valid theory exists--it’s not that plenty of opinion about various approaches or models is lacking, but that a theory, in the scientific sense, is nonexistent. For Big Data that means a number of things, first and foremost, that if you don’t have a working theory, you probably don’t know what data you need to test any hypotheses you may posit. It also means that data scientists can’t create a model because no reliable underlying logic exists that can be encoded into a model.
Confirmation Bias: Every model is based on historical assumptions and perceptual biases. Regardless of the sophistication of the science, we often create models that help us see what we want to see, using data selected as a good indicator of such a perception. Take a recent debate about how to price futures. Future events are typically discounted using an exponential model that creates a regular discount rate that eventually leads a value of zero for far-flung events. Exponential discounting takes a deterministic view. A more existential view comes from the proponents of hyperbolic discounting, which creates a preference for rewards that arrive sooner than later. With hyperbolic discounting, discounts of future events fall more gradually, leading to what might be called “irrational behavior.” Another version is called the declining discount rate.
This “discount” debate points out that even when a model exists that is designed to aid in decision making about the future, that model may involve contentious disagreements about its validity and alternative approaches that yield very different results. These are important debates in the world of Big Data. One group of modelers advocates for one approach, and another group, an alternative approach, both using sophisticated data and black boxes (as far as the uninitiated business person is concerned) to support their cases. The fact is that in cases like this, no one knows the answer definitively as the application may be contextual or it may be incomplete (e.g., a new approach may solve the issue that none of the current approaches solves completely). What can be said, and what must be remembered is, the adage that “a futurist is never wrong today.” Who wins these debates today may be meaningless because the implications have no near-term consequences, but companies that accept one approach over the other may be betting their firm’s future on wishful thinking and unwillingness to admit what they don’t know.
The World Changes: We must remember that all data is historical. There is no data from or about the future. Future context changes cannot be built into a model because they cannot be anticipated. Consider this: 2012 is the 50th anniversary of the 1962 Seattle World’s Fair. In 1962, the retail world was dominated by Sears, Montgomery Ward, Woolworth, A&P, and Kresge. Some of those companies no longer exist, and others have merged to the point that they are unrecognizable from their 1962 incarnations. Also in 1962, thousands of miles away, a small company opened its first location in Rogers, Ark.--the Wal-Mart Discount City. Would models of retail supply chains built in 1962 be able to anticipate the overwhelming disruption that this humble storefront would cause for retail? Did Sam Walton understand the impact of Amazon.com when it went live in 1995?
The answer to all of the above is "no." These innovations are rare and hugely disruptive. They are far from doomsday scenarios except for firms so entrenched in their models that they can’t adapt. And there is the problem for Big Data. As with Li’s risk analysis algorithm, when the world changed, the model did not. Feedback loops are important as a way of maintaining relevance through incremental improvement, but what happens when the world changes so much that current assumptions become irrelevant and the clock must be started again. Not only must we remember that all data is historical, but we must also remember that at some point historical data becomes irrelevant when the context changes.
Motives: In a recent BusinessWeek article (Palantir, "The War on Terror's Secret Weapon "), Peter Theil is quoted as saying: “We cannot afford to have another 9/11 event in the U.S. or anything bigger than that. That day opened the doors to all sorts of crazy abuses and draconian policies.” Theil, characterized in the article as a libertarian, sees the analysis of data as a positive for civil liberties. That position is debatable as groups like NO2ID.net  form to campaign against not only the use of Big Data, but its creation. Given the complexity of the data and associated models, along with various intended of unintended biases, organizations have to go out of their way to discern the motives of those developing analytics models, lest they allow programs to manipulate data in a way that may precipitate negative social, legal, or fiduciary outcomes.
Acting on the Model: Consider crime analysis. George Mohler of Santa Clara University in California has applied equations that predict earthquake aftershocks to crime. By using location and data and times of recent crimes, the system predicts “aftercrimes.” This kind of anticipatory data may result in bastions of police flooding a neighborhood following one burglary. With no police presence, the anticipated crimes may well take place. If the burglars, however, see an increase in surveillance and police activity, they may abandon planned targets and seek new ones, thus invalidating the models' predictions, potentially in terms of time and location. The proponents of Big Data need to ensure that the users of their models understand the intricacies of trend analysis, what a trend really is, and the implications of acting on a model’s recommendations.
Where Big Data Will Work
Some of the emerging Big Data stories don’t test the existential limits of technology, nor do they threaten global catastrophe. The worst outcome in dinosaur fossil hunting is not finding dinosaur bones where you expect them, and the worst outcome of a crime prediction is a burglary that doesn’t take place.
Big Data will no doubt be used to target advertising, reduce fraud, fight crime, find tax evaders, collect child support payments, create better health outcomes, and myrid other activities from the mundane to the ridiculous. And along the way, the software companies and those who invested in Big Data will share their stories.
Companies like monumental constructor Arup use Big Data as a way to better model the use of the buildings they build. The Arup software arm, Oasys, recently acquired MassMotion to help them understand the flow of people through buildings. MassMotion can model the intimacy of a coffee shop or the flow of hundreds of thousands through a terminal or metro system using its agent technology. His models are informed from data about traffic patterns, arrival and departure times for various forms of transportation and the environment, such as shopping locations and information desks where people might stop or congregate. The result is a model, sometimes with thousands of avatars, pushing and shoving, congregating and separating--all based on MassMotion’s Erin Morrow and how he perceives the world.
Another movement oriented application of Big Data, Jyotish (Sanskrit for astrology), comes from Boeing’s research cener at the University of Illinois in Urbana-Champaign. This application predicts the movement of work crews within Boeing’s factories. It will ultimately help them figure out how to save costs and increase satisfaction by ensuring that services, like Wi-Fi, are available where and when they are needed.
Palantir, the Palo Alto-based startup focused on solving the intelligence problem of 9/11, discovers correlations in the data that informs military and intelligence agencies who, what, and when a potential threat turns into an imminent threat. With access to data models and data across government silos, Palantir may well make its hero case more often than not in individual cases. They have to be cautious about applying their ideas to different domains where underlying rules might not be so clear or data so well-defined.
For some fields, like biology, placing large data sets into open source areas may bring a kind of convergence as collaboration ensues. But as Michael Nielsen points out in Reinventing Discovery , scientists have very little motivation to collaborate given the nature of publication, reputation, and tenure.
I seriously doubt that we have the intellectual infrastructure to support the collaborative capabilities of the Internet. We may well be able to connect all sorts of data and run all kinds of analyzes, but in the end, we may not be equipped to apply the technology in a meaningful and safe way at scales that outstrip our ability to represent, understand, and validate the models and their data.
A Telling Story
In the Scientific American article, Helbing relates an old story. A drunk man is looking for his keys under a street lamp. When asked why, he responds, “that's where the light is.” To Helbing, it appears that he wants to use Big Data to create a brighter light so that scientists can peer beneath if for insight. Scenario planners tell that story too, but we do so to make the point that while one is looking for keys in the light, they aren’t paying any attention to the darkness. The future of Big Data lies not in the stories of anecdotal triumph that report sophisticated, but limited accomplishments--no, the future of Big Data rather lies in the darkness of context change, complexity, and overconfidence.
I will end, as Thaler did in his New York Times article, by quoting Mark Twain: “It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.”
[Image: Flickr user Purplemattfish ]