On the dangers of sharing real data

There are very good reasons why a financial institution should never share their data. Actually, they should never even move their data. Ever.

First of all, the obvious: in the 21st century, almost every human action produces a digital footprint. The footprints we are leaving are out of control. Nobody can avoid their creation, nor can they avoid retention by the entity they are interacting with (e.g. the bank records of a person who checks their account balance on their mobile).

What one can ask, and what the GDPR regulations are trying to help with, is that we always know who has our information, and what that information is. We can ask it back, opt out.

Where it gets complicated, is what we are going to call the data marketplace. And down the rabbit hole we go. The bank might need a third party vendor  to develop some business application, like a fraud detector algorithm for example. For this, the third party vendor needs the data of millions of credit card transactions. The bank will ask users to consent to the use of their data for third-party interactions, but the user has no real way of knowing who is seeing it. The bank and the third party vendor sign a contract, which supposedly is keeping the data “safe”. It is visible to an undisclosed number of people, so we can only assume that “safe” means it’s not on fire.

The third party vendor at this point might aggregate the data from the user’s bank with data from other banks, or data from other sources like social media, and at this point, who owns this data? Each additional processing makes the regulatory grasp on this data more tenuous, the compliance more abstract. The user’s data, which they could not help but give to their bank for an activity ingrained in our way of life, is now far, far away. The user’s data is in the wild.

The ethics of the data marketplace practices are appalling, of course. But let’s focus for a moment on the bank.

The terms of agreement between the bank and its customers regarding data sharing must specify the scope and specific business case for which the data will be shared with third parties. Once the data is out of the bank, contracts and NDAs cover the chain-of-evidence, in theory.

In practice, and technologically speaking, the data is out in the world.

Customers may agree to share their data with one entity, for a specific use, and down the line may find that their data has been shared with other entities without their consent. There are frequent accounts of legal teams at financial institutions scrambling to update their Terms of Service to reflect the ever-expanding data-sharing activities of their innovation departments. This is a fast-moving area, and mistakes are easy to make. Financial institutions without iron-clad data processes and policies can find themselves unwittingly in violation of their own customer contracts.  

And going back to those NDAs… a bank has no real technological means to ensure that third party vendors are not sharing their data with other parties. Such a violation would be extremely hard to track down in practical terms, but it has very real consequences the moment it surfaces in the press or in public awareness.

A widespread concern stemming from the GDPR is the fact that some requirements are at the moment impossible to comply with, on a technological level: in other words, despite a bank’s best efforts, there is no technical solution that would ensure the privacy of their users once the data leaves the bank servers.

This is already bad enough. But let’s not forget the darker corners of the internet, where hackers are trying to find a way into the treasure trove of financial institutions data. The bank probably has the most sophisticated firewalls in existence.

But sharing data means copying data, over and over, and putting it in places with decreasing security measures, like the cloud, or a third party vendor’s server, or God forbid, excel spreadsheets. A hacker can stop worrying about the bank’s ponderous security measures, they just need to find the third party vendors and hack them instead.

The hacker is very grateful: this data is probably model-ready, wrangled from multiple data lakes in the bank and aggregated into a friendly, clean, compact format. Jackpot.

We have written already about the illusion of anonymisation as a solution to all this. 

Removing fields from the data, encoding them with dummy labels, all this is child’s play in the face of a data marketplace where data aggregators are for sale, where single-user data can be re-identified quickly and accurately from a myriad data sources, related to both online and offline activities. While we wait for a better future where we value the privacy of the individual in the first place, and we correct our economic model as a result, there is something we can do to mitigate the dangers of private data leaks and abuses, and move the world in the right direction. It’s called synthetic data. We’ll tell you all about it soon.