Taking steps to prevent jigsaw re-identification in genomic research
Four major UK research funders have released their joint response to a statement from the Expert Advisory Group on Data Access (EAGDA), on the issue of the re-identifiability of participants from genomic research studies. Natalie Banner, Policy Officer at the Wellcome Trust, explains why this issue is becomingly increasingly important.
When individuals participate in consented research studies, researchers are duty bound to protect their confidentiality as far as possible. Much genomic research relies on using large-scale aggregate datasets, which are anonymised, to establish which genes may be associated with different diseases. If any potentially identifiable data are made available for other researchers to access, there are strict legal restrictions in place stipulating how the data can and cannot be used. Such safeguards are designed to reduce the risk of participants being re-identified from research data, which would be a breach of confidentiality.
With advances in bioinformatics, an explosion in the volume of data, and an increasing push towards sharing data, there are growing possibilities for linking datasets in order to seek out and answer new research questions. To take just one example, linking data on the incidence of particular diseases, genomic markers, and socioeconomic indicators may reveal new insights into the complex relationship between aspects of health, genomics and environment.
However, linking data sets raises a challenging ethical and practical issue for participant confidentiality: the risk of jigsaw re-identification. With only one or two pieces of information, very little can be tied to particular individual and the possibility of actually identifying a person within the mass of aggregated data is remote. But when data from multiple sources is available, it may, in certain circumstances, allow a more complete picture of an individual to be pieced together. This could result in some confidential information being linked to an identifiable person. A paper on genomics published in Science last year demonstrated how it could be technically possible to link data in this way. The authors developed a complex methodology that involved linking open access genetic sequence data, information from publicly available genealogy databases that link surnames with specific genetic markers on the Y-chromosome, and public demographic records. The team successfully triangulated the identities and genomes of up to 50 participants from a research study, the 1000 Genomes Project. Importantly, while the participants themselves had given consent for their data to be openly and freely used, linking a genome to an individual has implications for their biological relatives as well, and future generations with whom they will share genetic characteristics.
Although the method used by the authors succeeded only in a highly specific set of circumstances, the paper alerted funders, researchers and institutions to the technical possibility that anonymised genomic data could, in principle, be subject to re-identification through linkage with other data sources. It’s important to realise that for the purposes of biomedical research, linking any research data to a name is neither necessary nor desirable: researchers want to know how different genes, diseases, environmental factors and so on relate to one another, not who has what condition. To continue the jigsaw analogy, they are much more interested in finding many pieces of the same shape or size from lots of different individuals than they are in putting together a complete picture of a single person. But if the risk of building such a picture is there, we need to mitigate it as strongly as possible.
In light of this, EAGDA conducted research last year into the issue of identifiability in UK research studies involving or linking to genomic data, seeking to establish whether there was a risk of participants being re-identified and steps that funders and study leaders could take to mitigate this risk. This culminated in a series of recommendations to EAGDA’s funders, the MRC, ESRC, CRUK and the Wellcome Trust. These recommendations centre around issues of consent, risk assessment, controlling access to data and enforcing sanctions against anyone found to have deliberately attempted to re-identify individuals from research data.
It would be naïve to presume that data can ever be 100% secure: there are going to be risks, but we believe that with good governance and management, and constant vigilance for the kinds of issues EAGDA has alerted us to, these can be managed. Given the recent public concerns over access to and the use of primary care records through the Government’s care.data scheme, it has never been more important for biomedical and health researchers to be transparent in how and why they use data, upfront to participants and their families about the risks involved and robust in the governance systems they use to control access to research data.
At the Wellcome Trust we’re continuing to work with EAGDA and the other funders to mitigate the risks of re-identification from the data our researchers generate and analyse. Our work cannot proceed without the generous participation in research from individuals all over the UK and beyond: it is our duty to push the boundaries of medical research whilst protecting and respecting their confidentiality as far as we can.
You can read the statement and response on the EAGDA section of the Wellcome Trust website.
Image credits: (Top) Peter Artymiuk, Wellcome Images, (Lower right)Adrian Cousins, Wellcome Images