Big Data, Big Claims: RSS meeting on Big Data, 19 Jan. 2015

On 19 January 2015, I attended the meeting on Big Data at the RSS. The speakers included: Kenneth Cukier (Data Editor at The Economist); Haishan Fu (Director of Development Data Group at the World Bank); Alex ‘Sandy’ Pentland (Professor at MIT and Academic Director of Data-Pop Alliance); Nuria Oliver (Scientific Director at Telefonica); John Pullinger (UK National Statistician), with Denise Lievesley (Dean of Faculty, King’s College) as Chair.

It is impossible to summarise the contributions or the excellent discussion, but here are some comments (recorded approximately) that give a flavour of the discussion.

Kenneth Cukier: “Big Data allows data to be produced as frequently as is needed, not at the convenience of the producers.”

Haishan Fu: “The WB’s aim is to support national data producers and to incentivise the private sector … with a particular aim to help poorer countries ‘leap-frog’ to catch up”.

Nuria Oliver: “Access to data from phone systems allows inferences, at macro levels about responses to large-scale flooding, and at individual levels about the socio-economic status of the user or their likely responses on a range of social issues.”

Sandy Pentland: “Having access to more data will allow us to solve more problems … data [ownership? / access?] should be decentralised to the citizen.”

John Pullinger: “Big Data provides a wake-up call, and an opportunity, for our statistical institutions.”

At the end of the session, the Chair, Denise Lievesley, asked for reactions from the audience, in terms of what we would take away from the meeting. My reaction was that the discussion had made me realise, even more strongly, that statisticians (along with others) over the years have refined a number of crucial ideas, based on a sense of methodological quality (or validity). These relate to several issues.

First, causality is to be distinguished from correlation. This can of course be accomplished by RCT designs, as well as by carefully-controlled and replicated non-experimental studies.

Second, it is important to appreciate how the available indicators have been constructed and understood by the human beings (or processes) producing them. Hence John Pullinger’s timely offer of a course in questionnaire design to one of the “Big Data-enthusiasts” on the panel.

Third, the data analyst also needs knowledge of how the sample before us has been selected, and attenuated by various forms of non-response.

These three methodological concerns are not always fully acknowledged by enthusiasts for big data. An era when “Big Data” is being extensively hyped will produce additional challenges for statisticians re. data management and data analysis. These are already being addressed by inter-disciplinary teams many in universities and organisations. But there is no reason for statisticians to be reticent about the contributions they can make to these clearly important debates.

Jeff Evans