Are anonymised data really anonymous?

Last Sunday, the Observer newspaper published a story denouncing the sale of medical data – compiled from GP surgeries and hospitals – of millions of NHS patients to American and other international drugs companies. This data is sold routinely, and for huge sums.

It is worth mentioning that medical data is of immense interest to pharmaceutical corporations and a top priority for the US drugs industry in particular, to the point that the Trump government has made clear it wants unrestricted access to Britain’s 55 million health records – estimated to have a total value of £10bn a year – as part of any post-Brexit trade agreement.

The Department of Health and Social Care, who collect this data, has repeatedly denied any wrongdoing, assuring that the data is anonymised and therefore it does not represent a privacy violation of the individual.

NHS data is increasingly sought by researchers and global drugs companies because it is one of the largest and most centralised public health organisations in the world.

NHS officials have repeatedly raised concerns that anomynised NHS data is not really anonymous, and that it is a matter of routine for buyers to sieve through the data for interesting or relevant medical histories and link it back to individual patient medical records.

The Observer quotes Phil Booth, coordinator of medConfidential, which campaigns for the privacy of health data, as saying that the public was being betrayed by claims that the information could not be linked back to individuals. “Removing or obscuring a few obvious identifiers, like someone’s name or NHS number from the data, doesn’t make their medical history anonymous,” he said. “Indeed, the unique combination of medical events that makes individuals’ health data so ripe for exploitation is precisely what makes it so identifiable. Your medical record is like a fingerprint of your whole life.”

The idea that anonymised data is confidential and secure is both naive and deceitful. In a nutshell, to anonymise a dataset means to mask some of its features out, like names and addresses of individuals. But there is a whole industry, a global data marketplace operating behind the “consent” buttons that now pop up on every website, that trades in data enrichment and linkage, with the aim to connect all our activities in a hyper-accurate picture of our individual lives. Companies like Adara and Liveramp promise their clients to be able to link any dataset back to individual persons, including dataset of offline behaviour.

Regulations, such as the GDPR, are catching up, but at the moment not quickly enough. But data technology can help – in fact, not all data applications lead to a privacy-less dystopian hellscape.  

Synthetic data is data which is artificially generated. In its simplest form, it’s created by funnelling real-world data through a noise-adding algorithm to construct a new data set, which captures the statistical features of the original information without giving away individual records. Quoting Anjana Ahuja from the Financial Times, the usefulness of synthetic data hinges on a principle known as differential privacy: that anybody mining synthetic data could make the same statistical inferences as they would from the true data — without being able to identify individual contributions.

It is important to notice that the vast majority of applications based on data intelligence do not need real-world data in order to be successfully built, and especially they don’t need individuals’ data profiles. Self-driving cars can be trained on videogame roads. Fraud detection algorithms do not need individual credit card transactions, but just the patterns of the transactions over a large sample of individuals. The only applications that need individual data profiles are applications that aim to influence individual behaviour, like targeted advertising, and individual surveillance.

Another interesting point is that real-world data is bogged down with bias. An infamous example is represented by algorithmic decision-making in fields such as criminal justice and credit scoring, which show stark evidence of racial discrimination. Because algorithms are trained on real-world data, they end up amplifying and perpetuating existing imbalances and injustices. Bill Howe, a data scientist from Washington University, believes that synthetic data is a solution to systemic bias and social issues: “We could modify that bias. People could release synthetic data that reflects the world we would like to have. Why not use those as training sets for AI?” In other words, synthetic data can make algorithms produce fairer decisions than humans, free from bias.

The world of synthetic data is evolving fast. And in a world where data is weaponised against individuals, synthetic data might just be a shield against abuse.

References