Geodemographics Blog

David Martin: 
Which data would you like with that - the Traditional, the Administrative or the Big?

I recently attended Eurostat's 2015 New Techniques and Technologies for Statistics (NTTS) conference in Brussels, which provided a great opportunity for some reflection. In the run-up to the 2011 census I was involved in efforts to build more time-sensitive representations of population, ranging from the new geography of census workplace zones and workplace data, through to development of a modelling framework for estimating 24/7 population distributions using a much more experimental integration of Open Data derived from multiple sources. At the same time I was very involved in the "Beyond 2011" debate about the future of the census, especially the continued need for small area outputs, and in the establishment of ESRC's new Administrative Data Research Centre for England (ADRC-E), intended to provide highly secure researcher access to linked administrative data extracts. It is very clear that the census workplace zones, although I hope they will prove very useful, represent fine-tuning of a long-established census system, which it seems clear will never be repeated exactly as in 2011. The themes of the NTTS conference bring many of these strands together, as presenters and delegates, mostly from national statistical organisations like our own ONS, visibly grappled with the enormous implications of moving beyond traditional official statistics to a world driven by the possibilities of new administrative and big data.

At this point, I should offer a personal distinction between administrative data, a category covering sources such as electoral or health registration where some kind of administrative process has been invoked, and truly big data such as mobile phone records, household energy usage from smart meters and many future data streams from smart devices - all generated by automated systems which log activity. Both strands were much in evidence. The session on "the worldwide modernisation of population censuses" was full of developments which lean heavily on administrative data sources. Aggregate statistics akin to census outputs could be produced in most countries but at the record level they would be considered highly confidential and subject to restricted access (an issue currently being revisited as part of the current ONS consultation on the Approved Researcher Scheme with the potential to help iron out some perverse restrictions on business users.) There is a rapidly developing new world of access controls and secure access facilities for anonymised record-level data, of which ADRC-E is a part, but at present there are very few finished studies from these efforts, as everyone is focused on getting infrastructure in place to support analysis that we believe users want to undertake, on datasets that can only be released once owners have comfort about the infrastructure! A tricky business of getting the carts and horses in the right order.

Alongside all this talk about administrative data, the conference included some enormously impressive presentations using really big data, for example from the Mobility Lab at the University of Tartu in Estonia working with Eurostat, on detailed representations of tourist behaviour using passive mobile phone positioning data. The session "privacy and access to big data" was full to capacity, reflecting the collective struggle of our statistical organizations to work out how these still-developing infrastructures for access and management of administrative data can be extended at such an early stage to handle the far greater challenges of big data. Alongside the substantial technical challenges, the legal frameworks for access are a highly contentious and complex issue - especially in the European setting. I am not overly pessimistic about all of this, but none of us should be under any illusions that getting this wrong in a major way could set back data access for many years.

Key issues that were not as prominent as I had expected were calibration and validation - when we try to move from traditional sources to new sources, can we achieve consistent measures of the same quantities? History has many unfortunate examples of re-basing measures such as inflation rates or population definitions, which created unbridgeable discontinuities in data series and endless argument over whose measurement is actually correct. While data democratisation is a great thing, the absence of generally agreed standard measures such as inflation rates, population sizes or (dare I say) migration rates most certainly is not. This is the traditional realm of official statistics. Many readers of this blog will remember the extensive programme of consultation and debate which led to the eventual selection of questions for the 2011 census. While this does not always yield the results that we hope for (e.g. a census income question) it certainly does serve to focus attention on the demand for information and the relative merits of one measure compared to another. It also leads to extensive testing to be as sure as possible that the measurement instrument works properly. By contrast, we know very little about the measurement that actually takes place in the administrative and especially the big data worlds. I can't help feeling that there is a huge amount still to be done in this area, and that debate doesn't really seem to have seriously started yet- at least not at NTTS.

This all presents some interesting parallels with long-established positions. Business users have on many occasions told ONS that good data early is preferable to perfect data later. Does the same apply to administrative and big data? I suspect that business users should be better able to adapt to some aspects of this new environment, having a need for current data and relatively agile systems, able to respond to different forms of measurement. However, the challenge of using statistics about whose provenance we are very uncertain is probably as much a problem for business as it is for government. Official statistics and the policy community more broadly are not so agile. Often for good reasons, we have come to accept the importance of being able to make historical comparisons, to conduct secondary research, to evidence the impact of policy interventions. If we are to find a sensible common ground then we are all going to have to let go of some cherished assumptions and certainly need to get back to fundamental questions about what we really want to measure and why. My plea to NTTS delegates and readers of this blog is to go back and think very hard about what you need to measure and why, rather than rushing to engage with each new possibility: we could very soon be overwhelmed by the possibilities!

David Martin is Professor of Geography at the University of Southampton. He has worked closely with ONS since the mid-1990s on census geography design and pioneered the creation of output areas and workplace zones. He is on the leadership team of three major ESRC Centres - the UK Data Service, National Centre for Research Methods and Administrative Data Research Centre for England.


Any views or opinions presented are solely those of the author and do not necessarily represent those of the MRS Census and Geodemographic Group unless otherwise specifically stated.

Comments (0)

No Comments Added

Leave a reply

Please enter the 3 black symbols only

  • Name is empty
  • Email is empty
  • Security code is empty
Previous Posts: