#Solo11: Drowning in data
- Open science
- Post-publication peer review
- Drowning in data
- Open data -> sooner treatments?
4) Drowning in data
Tim Hubbard from the Wellcome Trust Sanger Institute opened a fascinating panel session looking at how on earth we are dealing with the sheer volume of data science is now producing, and how that data, and our interpretation and visualisation of it is transforming and affecting scientific research and communication.
Hubbard set the scene with a familiar fact: genome sequencing is getting cheaper and cheaper and faster and faster at an unbelievable rate. That’s a huge amount of data Sanger is producing every day. How do they cope? You need 3 things, all to do with commitment: from your researchers, from the Institute, and enough resources (i.e. commitment from funders).
At Sanger, you can’t start an experiment until you have an accession number for the database you want to deposit it in and have your metadata (the tags that help you to find stuff in a database) sorted. Why? Because unless you do these things you’re left with a bucketload of data you can’t search, submit or use – particularly if the researcher who created it has left!
Data in repositories is useless unless you can use it, said Hubbard. Similarly, a big dataset is useless if it takes ages to load – especially if you then find it’s not of use to you.
Hubbard believes that other fields can take lessons from what biology has learned from large data handling: specifically that you want to extract and analyse, not just collect. What you want is a ‘value added database’. Curation is important – you need a society of ‘biocurators’ contributing, tweaking the parameters based on what they find. In any dataset there’s a ‘flickering’ of noise that make it hard to work out what’s significant and what’s not. Having a curator allows you more robustness.
There’s also the open access factor that’s so important at Sanger. With Sanger’s databases, anyone with credentials as a researcher can get access. Hubbard touched on the issue of openness versus privacy – can you get the benefit of both? Privacy, he said, was going to be the biggest issue, the rate limiting step. The revolution will be virtual machines that allow you to see bits of data, but not all of it. This model would be the equivalent of an ‘iTunes for data’ (drawing an analogy to the way Apple’s store allowed music lovers to, for the first time, purchase individual songs rather than a whole album).
Hubbard was also asked if it is now cheaper to regenerate data than to store it, to which he answered “not at the moment”. He wasn’t sure if that would ever be a good idea — if you keep [data] you have a lower latency to look at it for anything else. A good example is the NHS and figuring out which drip is most appropriate for each person. You don’t want to have to get samples again each time a new drug comes on the market. So caching is a big extra value against collecting data again, even if it becomes very cheap.
There’s also the fact that some samples can’t be collected again — perhaps because of death or because your study is looking at particular timepoints. A related question was whether there might be uses for so-called ‘null’ data that initially seems worthless. The session’s Chair noted that several pharmaceutical companies might find use in going back to search for new targets or explanations for why certain compounds failed. Hubbard agreed that there are some data you just wouldn’t want to throw away, such as weather data from 20 years ago.
Hubbard also spoke briefly about the exciting possibilities for future medicine through continuous monitoring. Sanger is competing for a block of funding that would see it collect data from patients in a way never before possible, allowing patients to have a continuous stream of monitoring data akin to that which we do for cars or aircraft. We don’t wait until a plane’s broken down to fix it – we give it constant checks and tune-ups. This might allow us to do the same for humans.
The rest of the panel dealt with what you can do with the mounds of data – and how you can make that digestible for the public. As someone in another session said, we respond to stories, not data, and what interested me was the way that interactivity is assisting digital storytelling and the types of stories you can pick out from large datasets.
Alastair Dant, Lead Interactive Technologist at The Guardian gave us a whistlestop tour of some of the coolest interactive data visualisations they’ve produced. I won’t go into this, as you really have to see them for yourself:
- Afghanistan: the war logs (you can play this passively, but at any time you can stop and drill down. Gives people both the passive experience and access to raw data.)
- Iraq war logs: a day in the life of war
- UK 2011 Election swingometer – basic idea is to empower the user to see the effects of any swing between 3 parties just with a simple interface (a joystick).
- Twitter traffic during World Cup 2010
- Carbon calculators (simple sliders allow people to create their own story arcs)
- Comprehensive spending review: you make the cuts(challenges readers to do better than the government)
- Hackgate: How Twitter tracked the MP’s questions and the pie
The Guardian does interesting things when it gains access to open datasets — such as the Twitter ‘firehose’ which gives you access to every tweet in the stream (which you can then use as a big video recorder of real-time events and apply filters to) or the UK budget (which their team painstakingly transformed from the report into a spreadsheet, which they could then visualise).The whole point, said Dant, is to make data more engaging. The aim is to have an open-ended system where the user can do what they like, configure it how they like (as in the ‘fantasy budget’ interactive). Of course, the data alone isn’t enough – there’s a strong editorial/curational input to help choose a passage of time that tells a story. I found it interesting how they focus in on different parts of a dataset, and sometimes ‘smooth out’ the data, to tell a story. The ethics of the latter was something of contention, and Dant conceded that there are often clashes with ‘academic robustness’. The partnership of data and editorial context is going to be key.
For more information, take a look at the full, detailed Storify of the whole panel session.
Want more? Full list of blog coverage of #Solo11 is available on their wiki.
Researchers: where are you having online discussions about your work? We’d like your input on this post.