This is a guest post by David Hales, a fellow associate of the new complexity think-tank, Synthesis. David specialises in computational social science and here he provides a thought-provoking response to the rise in big data, and some of the more outlandish claims made about it. For a good example of the latter, see Chris Anderson’s piece ‘The Data Deluge Makes the Scientific Method Obsolete‘. In this piece, David makes some very relevant points for development big data initiatives.

david-hales-chicheley-hall

  

Almost everything we do these days leaves some kind of data trace in some computer system somewhere. When such data is aggregated into huge databases it is called “Big Data”. It is claimed social science will be transformed by the application of computer processing and Big Data. The argument is that social science has, historically, been “theory rich” and “data poor” and now we will be able to apply the methods of “real science” to “social science” producing new validated and predictive theories which we can use to improve the world.

What’s wrong with this? On one level nothing. We know so little about the social world that anything is worth a try. Mining these huge databases will almost certainly lead to new ideas and insights. However, before we run headlong into this new world of big data, promoted as it is by corporations such as IBM and the large consultancies, perhaps we might benefit from a little critical reflection.

Firstly what is this “data” we are talking about? In it’s broadest sense it is some representation usually in a symbolic form that is machine readable and processable. And how will this data be processed? Using some form of machine learning or statistical analysis. But what will we find? Regularities or patterns (for a useful discussion of patterns within complex systems, see Greg Fisher’s post, Patterns Amid Complexity). What do such patterns mean? Well that will depend on who is interpreting them.

Given this level of generality, if someone tells you they are working on “big data” it tells you almost nothing. One way to approach the issue if confronted with a “big data” project is to ask the following question based on a thought experiment:

Imagine you had a massive computer database that contained all possible measurements that could ever be made over the entire span of all space and time. You could query it with any question and it would deliver the result instantaneously. All big data is merely a subset of this ‘the biggest data that could ever exist’.  What would your project ask it?”

If no coherent answer can be produced to this question then any such project is at best directionless and at worst not conscious of its aims.

One answer might be “looking for patterns or regularities in the data”. Looking for “patterns or regularities” presupposes a definition of what a pattern is and that presupposes a hypothesis or model, i.e. a theory. Hence big data does not “get us away from theory” but rather requires theory before any project can commence.

What is the problem here? The problem is that a certain kind of approach is being propagated within the “big data” movement that claims to not be a priori committed to any theory or view of the world. The idea is that data is real and theory is not real. That theory should be induced from the data in a “scientific” way.

I think this is wrong and dangerous. Why? Because it is not clear or honest while appearing to be so. Any statistical test or machine learning algorithm expresses a view of what a pattern or regularity is and any data has been collected for a reason based on what is considered appropriate to measure. One algorithm will find one kind of pattern and another will find something else. One data set will evidence some patterns and not others. Selecting an appropriate test depends on what you are looking for. So the question posed by the thought experiment remains “what are you looking for, what is your question, what is your hypothesis?”

It seems to me that one must at least try to answer this question if one is to pursue social science. Not just because it is good science but also because it has ethical and political implications.  The view one takes of social phenomena, either consciously or through algorithms and data, frames what is and is not conceivable for past and future social reality. If you doubt the importance of such ideas one should look that the history of the 20th century. Ideas matter. Theory matters. Big data is not a theory-neutral way of circumventing the hard questions. In fact it brings these questions into sharp focus and it’s time we discuss them openly.

Right now we are “data rich” and “theory poor”. We need new theory for the 21st century. That requires critical discussion, reflection, honestly and humility. It is not clear to me that such concerns are prominent within much of the “big data” movement.

Here is a more eloquent and playful take on these issues, by a colleague of mine, in the genre of that wonderful Orwell fable: https://scensci.wordpress.com/2012/12/14/big-data-or-pig-data/

Cross-posted from the Synthesis blog.

Join the conversation! 19 Comments

  1. Thank you, David, for this call to reason. The general failure to recognize the intimate relationship between perception and conception speaks to the poor job we do in science education. Bowing in awe before the great Oz no one knows to look behind the curtain.

    As with many technological panaceas, the hyped claims are made by folks who have never actually worked with the nuts-and-bolts of data mining. It’s messy and fraught with peril (esp. when you’re sure the data is telling you something). In any case, having a call to reason (however unheard it will be) is definitely a good thing. Keep it up.

    Reply
  2. The claim that big data is theory free is not widely made, for good reasons. Most people recognise that there are still assumptions being made (a) when a data set is being constructed – what attributes/variables to include versus exclude, (b) when an analysis is being designed – e.g. what outcome is of interest and what algorithms to use, (c) when data is being cleaned up prior to analysis, (d) and when results are being viewed – what stands out as a likely artifact versus possible pattern of real interest.

    But please let us not throw the baby out with the bathwater. Development aid projects can generate large amounts of data via their monitoring systems. Yet these data sets get relatively little attention compared to the focus on evaluation efforts and their own special data gathering requirements. In my view there is considerable potential for using simple data mining tools to make fuller use of monitoring data, to get something closer to a real time analysis and to give more recognition to the diversity of outcomes (and associated causal configurations) that can be found in many large development projects. For more on this subject see my recent paper on the use of Decision Tree analyses (both computer generated and participatory designed) http://mande.co.uk/2012/uncategorized/where-there-is-no-single-theory-of-change-the-uses-of-decision-tree-models/

    It is also worth remembering that the hypothesis testing approach to science (and evaluation) is only one side of the coin. The other side are inductive methods, of which data mining is one set – especially suited to large data sets. Both inductive (data derived) and deductive (theory derived) methods are needed, and are often complementary. I think Darwin’s theory of evolution can be described as an inductively derived theory, and interestingly, one which was not actually tested as a hypothesis until many decades later (http://www.sciencedaily.com/releases/2010/05/100512131513.htm)

    Theory led evaluation is in the ascendancy in the field of development evaluation at present. This should not blind us to the need for other approaches, especially where there is absence of a relevant theory or no single relevant theory.

    Reply
    • Hello Rick,

      You have made relevant and important points that merit close attention. Therefore, I hope you won’t mind if I respond to each point individually. And at the end I would try and summarize what I find problematic with the big data approach.

      >The claim that big data is theory free is not widely made, for good reasons. >Most people recognise that there are still assumptions being made (a) when a data set is being constructed – what attributes/variables to include versus >exclude, (b) when an analysis is being designed – e.g. what outcome is of >interest and what algorithms to use, (c) when data is being cleaned up prior to >analysis, (d) and when results are being viewed – what stands out as a likely >artifact versus possible pattern of real interest.

      From the little that I know, initially large datasets were randomly sampled to come up with a distribution to optimise queries, but now something called “core-sets” are being used to tap *all* representative points in the dataset. Core-sets are not ad-hoc but follow some basic mathematical rules. If that is what you mean by a model, then fine. But this was not brought into question, I don’t think, in the blog. Usually, in the sciences, data is selected based on some model or theory. But core-sets — and other such techniques — are surely not that. (These could be termed as “domain agnostic” ways of selecting data). Such techniques can perhaps take a prior position, but they are not “models” or “theories” in the sense in which David used the terms, and which I just tried to explain.

      However, if you don’t mean the above, and you actually do mean that data is being selected based on existing theories or models of the system/domain under investigation, then I fail to see what the problem is, and more importantly, I fail to see how your claims below about “data induced” theories, can be reconciled to this position.

      >But please let us not throw the baby out with the bathwater. Development aid >projects can generate large amounts of data via their monitoring systems. Yet >these data sets get relatively little attention compared to the focus on >evaluation efforts and their own special data gathering requirements. In my >view there is considerable potential for using simple data mining tools to make >fuller use of monitoring data, to get something closer to a real time analysis and >to give more recognition to the diversity of outcomes (and associated causal >configurations) that can be found in many large development projects. For >more on this subject see my recent paper on the use of Decision Tree >analyses (both computer generated and participatory designed) >http://mande.co.uk/2012/uncategorized/where-there-is-no-single-theory-of->change-the-uses-of-decision-tree-models/

      Yes, indeed you are right. In fact, people have formally verified operating systems as well as stream processing systems, among other things. This is all potentially useful! But surely, this is engineering and not science, social or otherwise. When people start claiming that we can make sense of the world without models, that we can do science without any theories, all based on large amounts of data, that’s where the problem arises, and this is what I think the blog post aims to address.

      >It is also worth remembering that the hypothesis testing approach to >science(and evaluation) is only one side of the coin. The other side are >inductive methods, of which data mining is one set – especially suited to large >data sets. Both inductive
      >(data derived) and deductive (theory derived) methods are needed, and are >often complementary. I think Darwin’s theory of evolution can be described >as an inductively derived theory, and interestingly, one which was not actually >tested as a hypothesis until many decades later (http://www.sciencedaily.com/releases/2010/05/100512131513.htm)

      You are talking about two different things here (though perhaps related). Let me first address testability. For what its worth, Newton’s second law of motion was not tested till 75 years (or so) after its formulation. As things stand presently, as we all know, string theory can’t be tested. But this point need not be discussed at length here as this does not really concern the issue at hand. Let’s ignore it and move forward. What is important, and what really needs to be discussed, is your other point: of “data derived theories”. And here, you have given the example of Darwin. I agree that Darwin did start out with his great journeys and seeing all the “data” around him. But if you notice, later on, when it actually came to theory formulation, he pretty much ignored all the data. As an example, he ignored the entire fossil record and made the leap of faith that it is not complete, and ended up with the gradualist program of evolution. Whereas the data was telling him that evolution should be abrupt and in distinct stages. The gradualist program could not possibly be induced from the data. Similarly, Galileo could not induce a rotating earth from the data available to him. The rotating earth hypothesis suggested that all objects on the surface of the earth should go flying off, whereas the data clearly showed that everything stayed put. Mendel did the same and ignored a lot of the data that did not fit the theory.

      >Theory led evaluation is in the ascendancy in the field of development >evaluation at present. This should not blind us to the need for other >approaches, especially where there is absence of a relevant theory or no >single relevant theory.

      Surely, all sciences at their earliest stages start from some experiential data, but as soon as they move into any form of maturity, they depend on data from experiments (and not experience). And those experiments are modeled by theory. And in fields where experiments can’t be done, those sciences too go into a direction where data is often ignored, distorted or selected, based on theory. In short, what is problematic is when some big-data advocates start claiming that theory can be induced from data. This seems like a rather alien concept of science, for all the reasons that I have discussed above.

      Reply
      • You hit the nail on the head. It would be nice to know who is making the claim that big data is theory-free. Otherwise, this is a great knockdown of a great straw man.

        I haven’t encountered anyone who says we should just look for patterns in data and not root causes.

        On the other hand, I’ve had to endure quite a few people who come up with complex theories and models from first principles, and feel no obligation to test them against real world data, nor could they pass the test.

  3. PS: Re”new world of big data, promoted as it is by corporations such as IBM and the large consultancies” I should also point out that there are also number of free open source data mining packages available on the web e.g RapidMiner See http://en.wikipedia.org/wiki/RapidMiner

    Reply
  4. Hi Rameez

    I will try my best to respond, though I may not have understood all that you said

    1. I think the Anderson posting (and position taken) is a bit of straw man, David Hales seems to acknowledge this when citing Andersen as an example of “some of the more outlandish claims”. I tried to support what I think is a more common view that data mining is not theory free by reference to my own experience of data mining, which is on small data rather than large sets, but nevertheless used some common data mining methods e.g. Decision Trees. I admit I may have muddied the waters by using the word theory in too wide a sense. When it comes to selecting and cleaning data sets, and interpreting findings all sorts of assumptions, often tacit, can creep into play. These are a sort of undercover “theories” of a modest kind.

    2. I don’t think I understand your references to engineering, in reply to my request to “not throw the baby out with the bathwater” Monitoring in (social) development projects is very much about applied social science . I think there are important opportunities to use data mining tools in the analysis of monitoring data, which was the point I was trying to make. Acceptance of this idea does not require buying into the idea that data mining can be theory free.

    3. I was linking theory with hypothesis testing, and data gathering/ mining with induction, and arguing that there is a place for both (and in fact often quite a complex inter-relationship over time). The fact that Darwin ignored some of his troublesome data does not surprise me at all. Theory led evaluation is dominant in my field at present, and that brings attendant risks that other ways of developing knowledge will be ignored. That was my concern.

    Reply
    • Hi Rick,
      Thanks for your reply. Ignoring some of the issues that could be debated, I find that we are pretty much in agreement and in fact much of what you say is also addressed in the original post by David. For example, when the post says: “Right now we are “data rich” and “theory poor”. We need new theory for the 21st century. That requires critical discussion, reflection, honestly and humility. It is not clear to me that such concerns are prominent within much of the “big data” movement.”

      I guess this is also saying that we need both theory and data. However, one important point: The position taken in Anderson’s post might be seem to be a straw-man; in fact Anderson later himself admitted that he was being provocative. But it would amuse you to know that this seemingly ridiculous, straw-man of a position is actually followed in practice! Its been going on in many fields. For example consider what the Nobel Prize winning biologist Sydney Brenner has to say about such an approach in his field: “The orgy of fact extraction in which everybody is currently engaged has, like most consumer economies, accumulated a vast debt. This is a debt of theory, and some of us are soon going to have an exciting time paying it back – with interest, I hope”

      And I feel in essence, this is what the post tried to bring out.

      Reply
  5. Dear David,
    Thanks for this good post. Couldn’t agree much more: analysis without theory; wouldn’t know how it looks like. These are similar points to what I tried to address in my blog on Captain Kirk (http://europeandcis.undp.org/blog/2012/11/28/development-data-still-needs-its-captain-kirk/ ).
    One thing though about big data is that it is – potentially – fast. The data is available and does not need to be collected in time-consuming manners. This helps if one wants to have ‘real-time’ insights. Our ‘evidence’ in the development debate so often is so old that one may question its relevance in our fast changing environments. Big data can help to come to terms with development trends faster than we could before and maybe increase the relevance of our thoughts (and subsequent action).
    Albert

    Reply
  6. Interesting thoughts. I fear I disagree with all of them 🙂
    Evolution, itself, is deduction.

    Less laconically: Genetic/evolutionary machine learning is deductive reasoning – if in a psychologically alien context. Yes you must have an initial aim to your project (otherwise its not a project), but you do not need a model. That will evolve. An expressed genome is a theory. Billions may get discounted. Just as billions of ideas form in our minds, but fail to ‘take hold’ when they fail to hold water/prove useful. Many, however, should eventually prove usefully predictive, in regard to your project’s aim. Rarely, but on occasion, one will remain the most potent for many generations, but eventually it will most likely get superseded. Very rarely some never will, and they become enshrined by dint of simply being ‘better’ than all other competing theories/expressed genomes. Just as they do in ‘science’. The problem with this is that humans want to be able to communicate their ‘theories’, and co-comprehend them. Currently, only the test of time can afford any credence to such black boxes, as today’s evolutionary algorithms. I predict a major goal of such machine learning techniques, will be to automatically describe the models that are generated by ML, in such a way as to afford critical analysis by humans, or possibly other independently created systems.

    The ethical quandary left at which point is then: “Aren’t we holding back progress, by imposing human requirements on these systems?” Do we needlessly anthropomorphize machines learning systems, and thereby hinder them? Or is scientific method or some analogous, machines based, critical analysis a necessary part of insight generation, and our trust thereof?

    Reply
  7. Dear all,

    Thanks for all the thoughtful comments on the post. Basically I think I ought to clarify. I am not against data or machine learning. What I was aiming at was the concept of “big data” as something that is somehow qualitatively different. That somehow just having a lot of data will make new social theories and models appear. I’ve worked extensively with many machine learning methods and they are great for finding certain patterns in data. These might be decision trees formed by recursively splitting the input attribute space based on information gain over some assigned class of interest. They may be genetic algorithms for which some fitness function or method has been assigned. You can even evolve, say, genetic programs that code dynamic behaviors.

    But what these algorithms can do does not support the great claims of “big data”. Big data is a research project in machine learning not a new way to dispense with theory. Theory directs what to look at, what data to collect and how to view it – interpret it.

    My background is in Artificial Intelligence and as a grad student working on machine learning and neural networks, at the time, I remember the big neural network hype. People said, we just need to put a massive neural network in a robot and give it all the human senses (all that data) and it would learn to do intelligent things – like interpreting it’s environment, walking and talking. Yes really. There was no new theory there, it was argued that the system would magically emerge these results. The ideas was “intelligence without representation”.

    Well, it didn’t work. Don’t get me wrong, very interesting things were done. But the hype turned out to be that.

    I’m not against people trying things – as I said in the post we know very little about how social systems work – but I don’t think we should suspend critical evaluation. Those promoting big data should try (at least try) to describe what kind of things they are looking for. I don’t feel saying “big data” and “machine learning” is really enough.

    What big data will do – as it always has – is provide useful statistical summaries for technical policy work. Actually, IBM itself grew out of the need for machines to process the US census data when it got too big for people. All that census data and machines that process it have been around for a long time. Essential for running a modern state – but it’s not producing new social theories.

    My remark about IBM and large consultancies is not about them trying to sell software. It’s about them getting large government contracts – and possibly control of data. Modern states have a lot of data and want people who can offer them solutions to their problems. Those who make big promises are likely to get a favorable hearing. One should be aware of these issues because they can distort debate.

    Overall, I’m all for data and machine learning. But let’s not forget to think. Thankfully, given the quality of the responses on this blog I feel a bit better about that.

    Thanks again!

    Reply
  8. […] pretty much along the lines of my first one. Despite the excitement about the age of Big Data (and here’s a much-needed antidote), the real issue is, regardless of size, will people choose to ignore […]

    Reply
  9. Dave, interesting post!! 🙂 My first takeaway after reading is ‘this guy wants to make sure that we have very clearly defined and relevant (though that goes without saying) questions before even considering that data.’ I think you make a very fine and important point that tends to get sidetracked with all the enthusiasm and hype over ‘big data.’

    Your post really adds some thinking material to our recently started initiative- we would like to see whether torrents of operational data that global development organizations create each day through implementation of projects (e.g. contracts, amounts, suppliers, etc) could be better used to improve operational effectiveness of UNDPs and WBs of this world.

    Our idea is to start exactly where you suggest- refine questions. We will organize a series of data dives (http://europeandcis.undp.org/blog/2013/01/11/can-big-data-help-deliver-better-operational-results/) where we would bring our organizational data sets to programmers and ask them whether what questions we could answer give the available data (can we predict likelihood of corruption in projects, what leads to under-delivery of certain projects, are there any companies that tend to win majority of contracts in any given sector and what are their linkages to other suppliers, etc), are we not asking questions we should be asking, and are we not collecting data we should be collecting.

    We tried to take a stab at few of these issues two weekends ago, looking at the World Bank data set of major awarded contracts from 2007-2013 and came up with some really interesting results (http://europeandcis.undp.org/blog/2013/01/31/big-data-and-development-organizations-what-happens-when-you-move-from-theory-to-practice/).

    So in a nut shell, I think we’re just at a tip of the iceberg, but any steps forward need to be grounded in asking proper question. And just as a side note no your comment on theory and data- I’m reading ‘Antifragility’ by Nassim Taleb, where he makes an argument that we first DO and then BUILD theory (as opposed to having theory then building something out of it) and I think in the context of big data that’s all the more relevant. We aren’t even aware of all the value we can generate, in our case operational data- we know some but not all. However, as we go along, as we start tinkering with the data, chances are we’ll know more and we’ll be able to contribute to social science.

    Again thanks so much for a super interesting post, really enjoyed reading it!!
    @ElaMi5

    Reply
  10. […] (excellent) Just 4 of 40 Massachusetts compounding pharmacies passed surprise health inspections Lies, Damned Lies and Big Data “God Made a Farmer” Super Bowl Commercial Celebrates Farmers, Yet Ignores Reality 30+ of the […]

    Reply
  11. […] Lies, Damned Lies and Big Data – David Hales issues a warning that the rush to use Big Data may be “data rich” but “theory poor” — with scientific as well as ethical and political implications. […]

    Reply
  12. […] this thought experiment from David Hales at the new complexity think tank Synthesis: "Imagine you had a massive computer database that […]

    Reply
  13. […] In this guest cross-post, Geoffrey West, former President of the Santa Fe Institute, argues that just as the industrial age produced the laws of thermodynamics, we need universal laws of complexity to solve intractable problems of the post-industrial era, and that ‘big data’ needs such ‘big theory’. For more on this topic, see David Hales’ guest post from February this year ‘Lies, Damned Lies and Big Data’. […]

    Reply

Leave a comment

About Ben Ramalingam

I am a researcher and writer specialising on international development and humanitarian issues. I am currently working on a number of consulting and advisory assignments for international agencies. I am also writing a book on complexity sciences and international aid which will be published by Oxford University Press. I hold Senior Research Associate and Visiting Fellow positions at the Institute of Development Studies, the Overseas Development Institute, and the London School of Economics.

Category

Accountability, Communications, Evaluation, Innovation, Knowledge and learning, Research